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Preface 


Why Should You Read This Book? 


While this book was written, i.e., in the year 2020, the so-called smart and intelligent 
systems were becoming available in increasing numbers. Such systems use comput- 
ers and other forms of information and communication technology (ICT) to provide 
services to humans, partially employing various kinds of artificial intelligence 
(AI). For example, recently introduced cars are increasingly capable of driving 
autonomously. In avionics and rail-based transportation, driver-less transportation 
is already available or on the horizon. The power grid is becoming smarter and the 
same applies to buildings. All these systems are based on a combination of ICT 
and physical systems called cyber-physical systems (CPS). Such systems can be 
defined as “engineered systems that are built from and depend upon the synergy of 
computational and physical components” [412]. Due to the direct interface between 
the physical and the cyber-world, cyber-physical systems have to be dependable. 

The physical world also plays a key role in the definition of the related 
term “Internet of Things” (IoT), referring to the physical world as “things.” IoT 
“describes ...a variety of devices ...able to interact and cooperate with each 
other to reach common goals” [185]. Examples of IoT applications include sensor 
networks or E-bikes that can be recollected due to available GPS information. 

Both terms, CPS and IoT, are generalizing and extending the earlier term 
“embedded systems” (ES). Embedded systems are information processing systems 
that are embedded into an enclosing product [371]. Compared to the term “embed- 
ded systems,” the terms CPS and IoT place more emphasis on physical objects, e.g., 
cars, airplanes, or smart devices. 

The steep rise in the availability of embedded and, correspondingly, also cyber- 
physical systems was already predicted in 2001: “Information technology (IT) is 
on the verge of another revolution. ... networked systems of embedded computers 
... have the potential to change radically the way people interact with their 
environment by linking together a range of devices and sensors that will allow 
information to be collected, shared, and processed in unprecedented ways. ... The 
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use ... throughout society could well dwarf previous milestones in the information 
revolution.” This citation from a report of the National Research Council in the 
USA [410] describes very nicely the dramatic impact of information technology 
in embedded systems. This revolution has already had a major impact and is still 
continuing. 

Terms like pervasive and ubiquitous computing, ambient intelligence, and 
“Industry 4.0” are also referring to the dramatic impact of changes caused by 
information technology. 

This importance of embedded/cyber-physical systems and IoT is so far not 
well reflected in many of the current curricula. However, designing the mentioned 
systems requires interdisciplinary knowledge and skills beyond the traditional 
boundaries of disciplines. Obtaining an overview of such broad knowledge is very 
difficult, due to the wide range of relevant areas. This book aims at facilitating the 
acquisition of knowledge from a kernel of relevant areas. It is already a challenge 
to identify the kernel of this knowledge. The book aims at being a remedy in this 
situation. It provides material for a first course on such systems and includes an 
overview of key concepts for the integration of ICT with physical objects. It covers 
hardware as well as software aspects. This is in-line with the ARTIST! guidelines 
for curricula of embedded systems: “The development of embedded systems cannot 
ignore the underlying hardware characteristics. Timing, memory usage, power 
consumption, and physical failures are important’ [85]. 

This book has been designed as a textbook. However, the book provides more 
references than typical textbooks do and also helps to structure the area. Hence, 
this book should also be useful for faculty members and engineers. For students, 
the inclusion of a rich set of references facilitates access to relevant sources of 
information. 

The book focuses on the fundamental bases of software and hardware. Specific 
products and tools are mentioned only if they have outstanding characteristics. 
Again, this is in-line with the ARTIST guidelines: “It seems that fundamental bases 
are really difficult to acquire during continuous training if they haven’t been initially 
learned, and we must focus on them” [85]. As a consequence, this book goes beyond 
teaching embedded system design by programming micro-controllers. The book 
presents the fundamentals of embedded systems design, which are needed for 
the design of CPS and IoT systems. With this approach, we would like to make 
sure that the material taught would not be outdated too soon. The concepts covered 
in this book should be relevant for a number of years to come. 

The proposed positioning of the current textbook in engineering curricula related 
to ICT is explained in a paper [372]. We want to relate the most important topics 
in this area to each other. This way, we avoid a problem mentioned in the ARTIST 
guidelines: “The lack of maturity of the domain results in a large variety of industrial 
practices, often due to cultural habits. ... curricula . . . concentrate on one technique 


‘ARTIST is the acronym of an European network of excellence for embedded systems (see http:// 
www.artist-embedded.org and http://www.emsig.net). 


Preface ix 


and do not present a sufficiently wide perspective... As a result, industry has difficulty 
finding adequately trained engineers, fully aware of design choices” [85]. 

The book should also help to bridge the gap between practical experiences with 
programming micro-controllers and more theoretical issues. Furthermore, it should 
help to motivate students and teachers to look at more details. While the book covers 
a number of topics in detail, others are covered only briefly. These brief sections 
have been included in order to put a number of related issues into perspective. 
Furthermore, this approach allows lecturers to have appropriate links in the book for 
adding complementary material of their choice. Due to the rich set of references, the 
book can also be used as a comprehensive tutorial, providing pointers for additional 
reading. Such references can also stimulate taking benefit of the book during labs, 
projects, and independent studies as well as a starting point for research. 

The scope of this book includes specification techniques, system software, 
application mapping, evaluation and validation, hardware components, and the 
interface between the cyber- and the physical world (the cyphy-interface) as well 
as exemplary optimizations and test methods. The book covers embedded systems 
and their interface to the physical environment from a wide perspective but cannot 
cover every related area. Legal and socio-economic aspects, human interfaces, data 
analysis, application-specific aspects, and a detailed presentation of physics and 
communication are beyond the scope of this book. The coverage of the Internet 
of Things is limited to areas linked to embedded systems. 


Who Should Read the Book? 


This book is intended for the following audience: 


e Computer science (CS), computer engineering (CE), and electrical engineering 
(EE) students as well as students in other information and communication 
technology (ICT)-related areas who would like to specialize in embedded/cyber- 
physical systems or IoT. The book should be appropriate for third-year students 
who do have a basic knowledge of computer hardware and software. This means 
that the book primarily targets senior undergraduate students.? However, it can 
also be used at the graduate level if embedded system design is not part of the 
undergraduate program or if the discussion of some topics is postponed. This 
book is intended to pave the way for more advanced topics that should be 
covered in follow-up courses. The book assumes a basic knowledge of computer 
science. EE students may have to read some additional material in order to fully 
understand the topics of this book. This should be compensated by the fact that 
some material covered in this book may already be known to EE students. 


>This is consistent with the curriculum described by T. Abdelzaher in a report on CPS education 
[411]. 
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e Engineers who have so far worked on system’s hardware and who have to move 
more toward software of embedded systems. This book should provide enough 
background to understand the relevant technical publications. 

e PhD students who would like to get a quick, broad overview of key concepts in 
embedded system technology before focusing on a specific research area. 

e Professors designing a new curriculum for the mentioned areas. 


How Is This Book Different from Earlier Editions? 


The first edition of this book was published in 2003. The field of embedded systems 
is moving fast, and many new results became available. Also, there are areas for 
which the emphasis shifted. In some cases, a more detailed treatment of the topic 
became desirable. These changes were considered when the first German edition of 
the book was published in 2007. Corresponding updates were also incorporated into 
the second English edition published in the late 2010/early 2011. 

In the last decade, more technological changes occurred. There was a clear shift 
from single core systems toward multi-core systems. Cyber-physical systems (CPS) 
and the Internet of Things (IoT) gained more attention. Power consumption, thermal 
issues, safety, and security became more important. Overall, it became necessary to 
publish a third edition of this textbook. The changes just described had a major 
impact on several chapters of the third edition. This edition included and linked 
those aspects of embedded systems that provide foundations for the design of 
CPS and IoT systems. The preface and the introduction were rewritten to reflect 
these changes. Partial differential equations and transaction-level modeling (TLM) 
were added to the chapter on specifications and modeling. The use of this book 
in flipped classroom-based teaching led to the consideration of more details, in 
particular of specification techniques. For the third edition, the chapter on embedded 
system hardware includes multi-cores, a rewritten section on memories, and more 
information on the cyphy-interface (including pulse-width modulation [PWM)]). 
Descriptions of field programmable gate arrays (FPGAs) were updated and a brief 
section on security issues in embedded systems included. The chapter on system 
software was extended by a section on Linux in embedded systems and more 
information on resource access protocols. In the context of system evaluation, new 
subsections on quality metrics, safety/security, energy models, and thermal issues 
were included. For this edition, the chapter on mapping to execution platforms 
was restructured: a standard classification of scheduling problems was introduced, 
and multi-core scheduling algorithms were added. The description of hardware— 
software codesign was dropped. The chapter on optimizations was updated and 
graphics were improved. Assignments (problems) and a clearer distinction between 
definitions, theorems, proofs, code, and examples were added. 

The current fourth edition is the first edition, which is available under an 
Open Access license. This change reflects the increasing importance of access to 
knowledge via the Internet. A key benefit is that this textbook becomes available to 
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students free of charge. During the preparation of this fourth edition, all chapters 
of the third edition have been carefully reviewed and updated if required. Errors 
found in the third edition have been corrected. The description of the bouncing ball 
experiment has been extended. The presentation of safety and security aspects has 
been restructured. More links to data analysis and artificial intelligence have been 
added. References have been updated. The distinction between jobs, tasks, threads, 
and processes has been clarified as much as possible. For this edition, it is typically 
not feasible to cover the complete book in a single course for undergraduates and 
lecturers can select a subset that fits the local needs and preferences. 


Dortmund, Germany Peter Marwedel 
January 2021 
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Frequently Used Mathematical Symbols 


Due to covering many areas in this book, there is a high risk of using the same 
symbol for different purposes. Therefore, symbols have been selected such that the 
risk of confusion is low. This table is supposed to help maintaining a consistent 
notation. 


a Weight 

a Allocation 

A Availability (— reliability) 
A Area 

A Ampere 

b.. Communication bandwidth 
B Communication bandwidth 
CR Characteristic vector for Petri net 
Cp Specific thermal capacitance 
Cy Volumetric heat capacity 

C; Execution time 

C Capacitance 

C Set of Petri net conditions 
Cin Thermal capacity 

°C Degree Celsius 

di Absolute deadline 

Di Relative deadline 

e(t) Input signal 

e Euler’s number (~2.71828) 
E Energy 

E Graph edge 

f Frequency 

fO General function 

f Probability density 

fi Finishing time of task/job i 


F Probability distribution 
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F Flow relation of Petri net 

g Gravity 

g Gain of operational amplifier 
g(t) Signal 

G Graph 

h Height 


h(t) Signal 


i Index, task/job number 
I Current 

j Index, dependent task/job 
J Set of jobs 

J Joule 

Jj Job j 

J Jitter 

k Index, processor number 
k Boltzmann constant (~1.3807 * 10723 J/K) 
K Kelvin 

l Processor number 

li Laxity of task/job i 

L Processor type 

L Length of conductor 

Li Lateness of task t; 

Limax Maximum lateness 

m Number of processors 

m Mass 

m Meter 

m Milli-prefix 

M Marking of Petri net 

M Smax Makespan 

n Index 

n Number of tasks/jobs 

N Net 

N Natural numbers 

00) Landauer’s notation 

Di Priority of task q; 

Di Place i of Petri net 

P Power 

P(S) Semaphore operation 

Q Resolution 

Q Charge 

ri Release time of task/job i 
R Reliability 

Rin Thermal resistance 

R Real numbers 


sS Time index 
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Restitution 

Starting time of task/job j 

Second 

State 

Semaphore 

Schedule 

Size of memory j 

Time 

Transition i of Petri net 

Period 

Period of task 1; 

Utilization of task T; 

Utilization 

Maximum utilization 

Velocity 

Graph nodes 

Voltage 

Volt 

Threshold voltage 

Semaphore operation 

Volume 

Signal 

Weight in Petri net 

Watt 

Input variable 

Signal 

Decision variable 

Signal 

Decision variable 

Signal 

Timer 

High impedance 

Integer numbers 

Arrival curve in real-time calculus 
Switching activity 

First component in Pinedo’s triplet 
Service function in real-time calculus 
Second component in Pinedo’s triplet 
Reciprocal of max. utilization 
Work load in real-time calculus 
Third component in Pinedo’s triplet 
Time interval 

Temperature 

Thermal conductivity 

Failure rate 


Number pi (*%3.1415926) 

Set of processors 

Processor i 

Mass density 

Task 7; 

Set of tasks 

Threshold for RM-US scheduling 
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Chapter 1 A 
Introduction FECA 


This chapter presents terms used in the context of embedded systems together with 
their history as well as opportunities, challenges, and common characteristics of 
embedded and cyber-physical systems. Furthermore, educational aspects, design 
flows, and the structure of this book are introduced. 


1.1 History of Terms 


Until the late 1980s, information processing was associated with large mainframe 
computers and huge tape drives. Later, miniaturization allowed information process- 
ing with personal computers (PCs). Office applications were dominating, but some 
computers were also controlling the physical environment, typically in the form of 
some feedback loop. 

Later, Mark Weiser created the term “ubiquitous computing” [573]. This term 
reflects Weiser’s prediction to have computing (and information) anytime, any- 
where. Weiser also predicted that computers are going to be integrated into 
products such that they will become invisible. Hence, he created the term “invisible 
computer.” With a similar vision, the predicted penetration of our day-to-day life 
with computing devices led to the terms “pervasive computing” and “ambient 
intelligence.” These three terms focus on only slightly different aspects of future 
information technology. Ubiquitous computing focuses more on the long-term goal 
of providing information anytime, anywhere, whereas pervasive computing focuses 
more on practical aspects and the exploitation of already available technology. 
For ambient intelligence, there is some emphasis on communication technology 
in future homes and smart buildings. Due to the widespread use of small devices 
in combination with the mobile Internet, some of the visions about the future have 
already become a common practice. This widespread use is pervasive in the sense 
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Fig. 1.1 Relationship 
between embedded systems Cyber-physical system (CPS) 
and CPS 
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that it already had an impact on many areas of our life. Furthermore, artificial 
intelligence is influencing our life as well. 

Miniaturization also enabled the integration of information processing and the 
environment using computers. This type of information processing has been called 
an “embedded system”: 


Definition 1.1 (Marwedel [371]) “Embedded systems are information processing 
systems embedded into enclosing products.” 


Examples include embedded systems in cars, trains, planes, and telecommuni- 
cation or fabrication equipment. Embedded system products such as self-driving 
cars and trains are already available or have been announced. Consequently, we 
can expect miniaturization to have an impact on embedded systems comparable 
to the one it had on the availability of mobile devices. Embedded systems come 
with a large number of common characteristics, including real-time constraints, 
and dependability as well as efficiency requirements. For such systems, the link 
to physical systems is rather important. This link is emphasized in the following 
citation [331]: 

“Embedded software is software integrated with physical processes. The techni- 
cal problem is managing time and concurrency in computational systems.” 

This citation could be used as a definition of the term “embedded software” and 
could be extended into a definition of “embedded systems” by just replacing 
“software” by “system.” 

However, the strong link to physical systems has recently been stressed even 
more by the introduction of the term “cyber-physical systems” (CPS for short). CPS 
can be defined as follows: 


Definition 1.2 (Lee [332]) “Cyber-Physical Systems (CPS) are integrations of 
computation and physical processes.” 


The new term emphasizes the link to physical processes and the corresponding 
physical environment. Emphasizing this link makes sense, since it is frequently 
ignored in a world of applications running on servers, PCs, and mobile phones. 
For CPS, models should include models of the physical environment as well. The 
term CPS comprises an embedded system (the information processing part) and a 
(dynamic) physical environment or CPS = ES + (dynamic) physical environment. 
This is also reflected in Fig. 1.1. 

In their call for proposals, the National Science Foundation in the USA mentions 
also communication [412]: “Emerging CPS will be coordinated, distributed, and 
connected and must be robust and responsive.” 
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Fig. 1.2 Importance of 
communication 
(© European Commission) 


Communication Embedded 
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This is also done in the acatech report on CPS [6]: CPS ... “represent networked, 
software-intensive embedded systems in a control loop, provide networked and 
distributed services.” 

Interconnection and collaboration are also explicitly mentioned in a call for 
proposals by the European Commission [155]: “Cyber-Physical Systems (CPS) refer 
to next generation embedded ICT systems that are interconnected and collaborating 
including through the Internet of Things, and providing citizens and businesses with 
a wide range of innovative applications and services.” 

The importance of communication was visualized by the European Commission 
earlier, as shown in Fig. 1.2. 

From these citations, it is clear that the authors do not only associate the 
integration of the cyber- and the physical world with the term CPS. Rather, there 
is also a strong communication aspect. Actually, the term CPS is not always used 
consistently. Some authors emphasize the integration with the physical environment, 
others emphasize communication. 

Communication is more explicit in the term “Internet of Things” (IoT), which 
can be defined as follows: 


Definition 1.3 ({185]) The term Internet of Things “describes the pervasive pres- 
ence of a variety of devices — such as sensors, actuators, and mobile phones — 
which, through unique addressing schemes, are able to interact and cooperate with 
each other to reach common goals.” 


This term is linking sensors (such that sensed information is available on the 
Internet) and actuators (such that things can be controlled from the Internet). The 
Internet of Things is expected to allow the communication between trillions of 
devices in the world. This vision affects a large amount of businesses. 

The exploitation of IoT-technology for production has been called “Industry 4.0” 
[68]. Industry 4.0 targets a more flexible production for which the entire life cycle 
from the design phase onward is supported by the IoT. 


4 1 Introduction 


To some extent, it is a matter of preferences whether the linking of physical 
objects to the cyber-world is called CPS or IoT. Taken together, CPS and IoT include 
most of the future applications of IT. 

The design of these future applications requires knowing fundamental 
design techniques for embedded systems. This book focuses on such fundamen- 
tal techniques and on foundations of embedded system design. Please remember 
that these are used in IoT and CPS designs though this is not repeatedly stated in 
each context. However, application-specific aspects of CPS and IoT are usually not 
covered. 


1.2 Opportunities 


There is a huge potential for applications of information processing in the context of 
CPS and IoT. The following list demonstrates this potential and the large variation 
of corresponding areas: 


¢ Transportation and mobility: 


— Automotive electronics: Modern cars can be sold in technologically 
advanced countries only if they contain a significant amount of electronics 
[415]. These include airbag control systems, engine control systems, 
navigation systems, anti-braking systems (ABS), electronic stability programs 
(ESP), air-conditioning, anti-theft protection, driver assistance systems, and 
many more. There is a trend toward autonomous driving. Embedded systems 
can improve comfort levels, avoid accidents, and reduce the impact on the 
environment. E-mobility would not be feasible without a significant amount 
of electronic components. 

— Avionics: A significant amount of the total value of airplanes is due to 
the information processing equipment, including flight control systems, anti- 
collision systems, pilot information systems, autopilots, and others. Depend- 
ability is of utmost importance.! Embedded systems can decrease emissions 
(such as carbon dioxide) from airplanes. Autonomous flying is also becoming 
a reality, at least for certain application areas. 

— Railways: For railways, the situation is similar to the one discussed for cars 
and airplanes. Again, safety features contribute significantly to the total value 
of trains, and dependability is extremely important. Advanced signaling aims 
at safe operation of trains at high speed and short intervals between trains. The 
European Train Control System (ETCS) [444] is one step in this direction. 
Autonomous rail-based transportation is already used in restricted contexts 
like shuttle trains at airports. 


'Problems with Boeing’s 737 MAX [419] underline this statement. 
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— Maritime engineering (ships, ocean technology, and other maritime sys- 
tems): Maritime systems, such as modern ships, use large amounts of 
ICT equipment, e.g., for navigation, for safety, for optimizing the opera- 
tion in general, and for bookkeeping (see, e.g., http://www.smtcsingapore. 
com/ and https://dupress.deloitte.com/dup-us-en/focus/internet-of-things/iot- 
in-shipping-industry.html). 

— New concepts for mobility: The use of ICT technology and its components 
is enabling new concepts for mobility. Even untrained people can travel 
larger distances with e-bikes. The subtle interaction between human muscles 
and electric engines turns e-scooters into a prime example of cyber-physical 
systems. The collection of e-scooters at the end of each day, based on a list 
of locations in the Internet, lets e-scooters become a perfect example of the 
Internet of Things. Also, CPS/IoT-technology is very important for collective 
taxis and other taxi-calling services. 


e Mechanical engineering (incl. manufacturing): Machinery and fabrication 
equipment have been combined with embedded systems for decades. In order to 
optimize production technologies further, CPS/IoT-technology can be used. 
CPS/IoT-technology is the key toward more flexible manufacturing, being 
the target of “Industry 4.0” [68]. Factory automation is enabled by logistics. 
There are several ways in which CPS/IoT-systems can be applied to logistics 
[297]. For example, radio frequency identification (RFID) technology, if used 
in combination with computer networks, provides easy identification of each 
and every object, worldwide. Mobile communication allows unprecedented 
interaction. 

e Robotics: This is also a traditional area in which embedded/cyber-physical 
systems have been used. Mechanical aspects are very important for robots. 
Hence, they may be linked to mechanical engineering. Robots, modeled after 
animals or human beings, have been designed. Figure 1.3 shows such a robot. 

e Power engineering and the smart grid: In the future, the production of 
energy is supposed to be much more decentralized than in the past. Providing 
stability in such a scenario is difficult. ICT technology is required in order to 
achieve a sufficiently stable system. Information on the smart grid can be found, 
for example, at https://www.smartgrid.gov/the_smart_grid and at http://www. 
smartgrids.eu/. 

e Civil engineering: CPS devices can be beneficial in many applications of civil 
engineering. This includes structural health monitoring. Natural and artificial 
structures like mountains, volcanoes, bridges, and dams (see, e.g., Fig. 1.4) are 
potentially threatening lives. We can use embedded system technology to enable 
advance warnings in case of increased dangers like avalanches or collapsing 
dams.” 


?The case of the dam in Brumadinho (see _ https://en.wikipedia.org/wiki/ 
Brumadinho_dam_disaster) is a counterexample of how modern sensors should be exploited. 
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Fig. 1.3 Humanoid Robot 
“Lola”, © Chair of Applied 
Mechanics, Technical 
University of Munich (TUM) 


Fig. 1.4 Example of a dam to be monitored (Möhnesee dam), ©P. Marwedel 


e Disaster recovery: In the case of major disasters such as earthquakes or flooding, 
it is essential to save lives and provide relief to survivors. Flexible communication 
infrastructures are essential for this. 

e Smart buildings: Smart buildings are one of the areas of civil engineering. 
Information processing can be used to increase the comfort level in buildings, 
can reduce the energy consumption within buildings, and can improve safety and 
security. Subsystems which traditionally were unrelated must be connected for 
this purpose. There is a trend toward integrating air-conditioning, lighting, access 
control, accounting, safety features, and distribution of information into a single 
system. Tolerance levels of air-conditioning subsystems can be increased for 
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empty rooms, and the lighting can be automatically reduced. Air-condition noise 
can be reduced to a level required for the actual operating conditions. Intelligent 
usage of blinds can also optimize lighting and air-conditioning. Available rooms 
can be displayed at appropriate places, simplifying ad hoc meetings and cleaning. 
Lists of non-empty rooms can be displayed at the entrance of the building 
in emergency situations (provided the required power is still available). This 
way, energy can be saved on cooling, heating, and lighting. Also safety can 
be improved. Initially, such systems might mostly be present in high-tech office 
buildings, but the trend toward energy-efficient buildings also affects the design 
of private homes. One of the goals is to design so-called zero-energy-buildings 
(buildings which produce as much energy as they consume) [426]. Such a design 
would be one contribution toward a reduction of the global carbon-dioxide 
footprint and global warming. 

e Agricultural engineering: There are many agricultural applications. For exam- 
ple, the “regulations for traceability’ of agricultural animals and their move- 
ments require the use of technologies like IoT, making possible the real time 
detection of animals, for example, during outbreaks of (a) contagious disease” 
[516]. 

¢ Health sector and medical engineering: The importance of healthcare products 
is increasing, in particular in aging societies. Opportunities start with new sen- 
sors, detecting diseases faster and more reliably. New data analysis techniques 
(e.g., based on machine learning) can be used to detect increased risks and 
improve chances for healing. Therapies can be supported with personalized 
medication based on artificial intelligence methods. New devices can be designed 
to help patients, e.g., handicapped patients. Also, surgery can be supported 
with new devices. Embedded system technologies also allow for a significantly 
improved result monitoring, giving doctors much better means for checking 
whether or not a certain treatment has a positive impact. This monitoring also 
applies to remotely located patients. Available information can be stored in 
patient information systems. Lists of projects in this area can be found, for 
example, at http://cps-vo.org/group/medical-cps and at http://www.nano-tera.ch/ 
program/health.html. 

¢ Scientific experiments: Many contemporary experiments in sciences, in partic- 
ular in physics, require the observation of experiment outcomes with IT devices. 
The combination of physical experiments and IT devices can be seen as a special 
case of CPS. 

e Public safety: The interest in various kinds of safety is also increasing. Embed- 
ded and cyber-physical systems and the Internet of Things can be used to improve 
safety in many ways. This includes public health in times of pandemics and the 
identification/authentication of people, for example, with fingerprint sensors or 
face recognition systems. 


3The importance of traceability in general, beyond animals, became particularly obvious during 
the Corona-19 crisis. 
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e Military applications: Information processing has been used in military equip- 
ment for many years. Some of the first computers analyzed military radar signals. 

¢ Telecommunication: Mobile phones have been one of the fastest-growing mar- 
kets in the recent years. For mobile phones, radio frequency (RF) design, digital 
signal processing, and low-power design are key aspects. Telecommunication is 
a Salient feature of IoT. Other forms of telecommunication are also important. 

e Consumer electronics: Video and audio equipment is a major sector of the 
electronics industry. The information processing integrated into such equipment 
is steadily growing. New services and better quality are implemented using 
advanced digital signal processing techniques. Many TV sets (in particular high- 
definition TV sets), multimedia phones, and game consoles comprise powerful 
high-performance processors and memory systems. They represent special cases 
of embedded systems. Compared to other types of embedded systems, safety and 
real-time behavior are less important. Nevertheless, certain real-time constraints 
must be met in order to achieve a certain frame rate or to meet time constraints 
for communication protocols. Also, there is a limited availability of resources 
like electrical energy and communication bandwidth. In this sense, limited 
availability of resources is a feature which consumer electronics shares with the 
other application areas mentioned so far. 


The large set of examples demonstrates the huge variety of applications of 
embedded systems in CPS and IoT systems. Even more applications are listed in a 
report on opportunities and challenges of the IoT [516]. In a way, many of the future 
applications of ICT technology can be linked to such systems. From the above list, 
we conclude that almost all engineering disciplines will be affected. 

The long list of application areas of embedded systems is resulting in a 
corresponding economic importance of such systems. The acatech report [6] 
mentions that, at the time of writing the report, 98% of all microprocessors were 
used in these systems. In a way, embedded system design is an enabler for many 
products and has an impact on the combined market volume size of all the areas 
mentioned. However, it is difficult to quantify the size of the CPS/IoT market since 
the total market volume of all these areas is significantly larger than the market 
volume of their ICT components. Referring to the value of semiconductors in the 
CPS/IoT market would also be misleading, since that value is only a fraction of the 
overall value. 

The economic importance of CPS and the IoT is reflected in calls for proposals 
by funding organizations, like the NSF [116] and the European Commission [156]. 


1.3 Challenges 


Unfortunately, the design of embedded systems and their integration in CPS and 
IoT systems comes with a large number of difficult design issues. Commonly found 
issues include the following: 
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e Cyber-physical and IoT systems must be dependable. 


Definition 1.4 A system is dependable if it provides its intended service with a 
high probability and does not cause any harm. 


A key reason for the need of being dependable is that these systems are directly 

connected to the physical environment and have an immediate impact on that 

environment. The issue needs to be considered during the entire design process. 
Dependability encompasses the following aspects of a system: 


1. Security: 


Definition 1.5 ((75, 255]) Information security can be defined as the “preser- 
vation of confidentiality, integrity and availability of information.” 


This preservation can be compromised by thefts or damages, resulting from 
attacks from the outside. Connecting components in IoT systems enables such 
attacks, with cyber-crime and cyber-warfare as special, potentially harmful 
cases. Connecting more components enables more attacks and more damages. 
This is a serious issue in the design and proliferation of IoT systems. 

The only really secure solution is to disconnect components, which 
contradicts the idea of using connected systems. Related research is therefore 
expected to be one of the fastest-growing areas in ICT-related research. 

According to Ravi et al. [300], the following typical elements of security 
requirements exist: 


— Auser identification process validates identities before allowing users to 
access the system. 

— Secure network access provides a network connection or service access 
only if the device is authorized. 

— Secure communications include a number of communication features. 

— Secure storage requires confidentiality and integrity of data. 

— Content security enforces usage restrictions. 


2. Confidentiality is one of the aspects of security . 


Definition 1.6 ([255]) Confidentiality is “property that information is not 
made available or disclosed to unauthorized individuals, entities, or pro- 
cesses.” 


Confidentiality is typically implemented using techniques which are found in 
secure systems, e.g., encryption. 
3. Safety: 


Definition 1.7 ((250]) Safety can be defined as the absence of “unacceptable 
risk of physical injury or of damage to the health of people, either directly or 
indirectly as a result of damage to property or to the environment.” 

“Functional safety is the part of the overall safety that depends ona system 
or equipment operating correctly in response to its inputs.” 
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“In the context of computer systems, this term is used to distinguish from 
threats due to external attacks, e.g., due to malicious software. In contrast 
to such threats, safety refers to risks caused by failures occurring without 
any external action, e.g., hardware failures, power failures, incorrectly written 
software, or operator errors” (translated from German [576]). 

4. Reliability: This term refers to malfunctions of systems resulting from 
components not working according to their specification at design time. 
Lack of reliability can be caused by breaking components. Reliability is the 
probability that a system will not fail within a certain amount of time.* For 
an evaluation of reliability, we are not considering malicious attacks from 
the outside but only effects occurring within the system itself during normal, 
intended operation. 

5. Repairability: Repairability (also spelled reparability) is the probability that 
a failing system can be repaired within a certain time. 

6. Availability: Availability is the probability that the system is available. 
Reliability and repairability must be high and security hazards absent in order 
to achieve a high availability. 


Designers may be tempted to focus just on the functionality of systems initially, 
assuming that dependability can be added once the design is working. Typically, 
this approach does not work, since certain design decisions will not allow 
achieving the required dependability in the aftermath. For example, if the 
physical partitioning is done in the wrong way, redundancy may be impossible. 
Therefore, “making the system dependable must not be an after-thought”, it 
must be considered from the very beginning [303]. Good compromises achieving 
an acceptable level of safety, security, confidentiality, and reliability have to be 
found [296]. 

Even perfectly designed systems can fail if the assumptions about the 
workload and possible errors turn out to be wrong [303]. For example, a system 
might fail if it is operated outside the initially assumed temperature range. 

e If we look closely at the interface between the physical and the cyber-world, we 
observe a mismatch between physical and cyber models. The following list 
shows examples: 


— Many cyber-physical systems must meet real-time constraints. Not complet- 
ing computations within a given time frame can result in a serious loss of the 
quality provided by the system (e.g., if the audio or video quality is affected) 
or may cause harm to the user (e.g., if cars, trains, or airplanes do not operate 
in the predicted way). Some time constraints are called hard time constraints: 


Definition 1.8 (Kopetz [303]) “A time-constraint is called hard if not meet- 
ing that constraint could result in a catastrophe.” 


All other time constraints are called soft time constraints. 


4A formal definition of this term is provided in Definition 5.36 on p. 281 of this book. 
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Many of today’s information processing systems are using techniques for 
speeding-up information processing on the average. For example, caches 
improve the average performance of a system. In other cases, reliable com- 
munication is achieved by repeating certain transmissions. On average, such 
repetitions result in a (hopefully) small loss of performance, even though for a 
certain message the communication delay can be several orders of magnitude 
larger than the normal delay. In the context of real-time systems, arguments 
about the average performance or delay cannot be accepted. “A guaranteed 
system response has to be explained without statistical arguments” [303]. 
Many modeling techniques in computer science do not model real time. 
Frequently, time is modeled without any physical units attached to it, which 
means that no distinction is made between picoseconds and centuries. The 
resulting problems are very clearly formulated in a statement made by Edward 
Lee: “The lack of timing in the core abstraction (of computer science) is a flaw, 
from the perspective of embedded software” [330]. 

— Many embedded systems are hybrid systems in the sense that they include 
analog and digital parts. Analog parts use continuous signal values in con- 
tinuous time, whereas digital parts use discrete signal values in discrete 
time. Many physical quantities are represented by a pair, consisting of a real 
number and a unit. The set of real numbers is uncountable. In the cyber- 
world, the set of representable values for each number is finite. Hence, almost 
all physical quantities can only be approximated in digital computers. 
During simulations of physical systems on digital computers, we are typically 
assuming that this approximation gives us meaningful results. In a paper, Taha 
considered consequences of the non-availability of real numbers in the cyber- 
world [522]. 

— Physical systems can exhibit the so-called Zeno effect. The Zeno effect can 
be introduced with the help of the bouncing ball example. Suppose that we 
are dropping a bouncing ball onto the floor from a particular height. After 
releasing the ball, it will start to fall, being accelerated by the gravitation of 
the earth. When it hits the floor, it will bounce, i.e., it will start to move in 
the opposite direction. However, we assume that bouncing will have some 
damping effect and that the initial speed of the ball after the bounce will be 
reduced by a factor of s < 1, compared to the speed right before the bounce. 
The case s < 1 is also called inelastic collision. s is called the restitution. Due 
to this, the ball will not reach its initial height. Furthermore, the time to reach 
the floor a second time will be shorter than for the initial case. This process 
will be repeated, with smaller and smaller intervals between the bounces. 
However, according to the ideal model of inelastic collisions, this process will 
go on and on. Figure 1.5 visualizes the height as a function of time (a so-called 
time/distance diagram) of the inelastic collision. 

Now, let A be an arbitrary time interval, anywhere in the time domain. 
Would there be an upper bound on the number of bounces in this time interval? 
No, there would not be an upper bound, since bouncing is repeated in shorter 
and shorter intervals. 
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Fig. 1.5 Time/distance diagram of the inelastic collision (© Openmodelica) 
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Fig. 1.6 Control loop 


This is a special case of the Zeno effect. A system is said to exhibit a Zeno 
effect, when it is possible to have an unlimited number of events in an interval 
of finite length [403]. Mathematically speaking, this is feasible since infinite 
series may be converging to a finite value. In this case, the infinite series of 
times at which bouncing occurs is converging to a finite instance in time. See 
the discussion starting on p. 46 for more details. On digital computers, the 
unlimited number of events can only be approximated. 

Many CPS comprise control loops, like the one shown in Fig. 1.6. 

Control theory was initially based on analog continuous feedback systems. 
For digital, discrete time feedback, periodic sampling of signals has been 
the default assumption for decades and it worked reasonably well. However, 
periodic sampling is possibly not the best approach. We could save resources 
if we would extend sampling intervals during times of relatively constant 
signals. This is the idea of adaptive sampling. Adaptive sampling is an area 
of active research [209]. 
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— Traditional sequential programming languages are not the best way to describe 


concurrent, timed systems. 

Traditionally, the process of verifying whether or not some product is a correct 
implementation of the specification is generating a Boolean result: either the 
product is correct or not. However, two physically existing products will never 
be exactly identical. Hence, we can only check with some level of imprecision 
whether a product is a correct implementation of the design. This introduces 
fuzziness and Boolean verification is replaced by fuzzy verification [184, 446]. 
Edward Lee pointed out that the combination of a deterministic physical 
model and a deterministic cyber model will possibly be a non-deterministic 
model [333]. Non-deterministic sampling can be one reason for this. 


Overall, we observe a mismatch between the physical and the cyber-world. 
Effectively, we are still looking for appropriate models for CPS, but cannot expect 


to completely eliminate the mismatch. 


e Embedded systems must use resources efficiently. This requires that we must be 
aware of the resources needed. The following metrics can be considered in order 


to evaluate resource efficiency: 


1. Energy: Electronic information and communication technology (ICT) uses 


electrical energy for information processing and communication. The amount 
of electrical energy used is frequently called “consumed energy.” Strictly 
speaking, this term is not correct, since the total amount of energy is invariant. 
Actually, we are converting electrical energy into some other form of energy, 
typically thermal energy. For embedded systems, the availability of electrical 
power and energy (as the integral of power over time) is a deciding factor. This 
was already observed in a Dutch road mapping effort: “Power is considered 
as the most important constraint in embedded systems” [150]. 

Why should we care about the amount of electrical energy converted, i.e., 
why should there be energy awareness? There are many reasons for this. Most 
reasons are applicable to most types of systems, but there are exceptions, as 


shown in Table 1.1. 


Table 1.1 Relevance of reasons to be energy-aware 


Relevant during use? 


System type Plugged Charged Unplugged 
Example Factory Laptop Sensor network 
Global warming Yes Hardly No 

Cost of energy Yes Hardly Typically not 
Increasing performance Yes Yes Yes 

Unplugged uptime No Yes Yes 

Problems with cooling, avoiding hot spots Yes Yes Yes 

Avoiding high currents, metal migration Yes Yes Yes 

Energy a very scarce resource No Hardly Yes 
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Global warming is of course a very important reason for trying to be 
energy-aware. However, typically very limited amounts of energy are avail- 
able to unplugged systems, and, hence, their contribution to global warming 
is small.’ 

The cost of energy is relevant whenever the amount of energy needed is 
expensive. For plugged systems, this could happen due to large amounts of 
consumed energy. For unplugged systems, these amounts are typically small, 
but there could be cases for which it is very expensive to provide even a small 
amount. 

Increased computing performance usually requires additional energy and, 
hence, has an impact on the resulting energy consumption. 

Thermal effects are becoming more important and have to be considered 
as well. The reliability of circuits decreases with increasing temperatures. 
Hence, increased energy consumptions are typically decreasing the reliability. 
It may be necessary to power-down parts of the system completely to cope 
with thermal constraints. This effect has been called the dark silicon effect 
(certain areas of silicon chips have to remain unpowered or “dark’’) [153]. 

In some cases (like remote sensor nodes), energy is a really scarce resource. 

It is interesting to look at those cases where certain reasons to save energy 
can be considered irrelevant: For systems connected to the power grid, energy 
is not a really scarce resource. Unplugged systems, due to the limited capacity 
of batteries, consume very small amounts of energy, and their impact on global 
warming is small. Systems which are only temporarily connected to the power 
grid are somewhere between their plugged and unplugged counterparts. 

The importance of power and energy efficiency was initially recognized 
for embedded systems. The focus on these objectives was later taken up for 
general-purpose computing as well and led to initiatives such as the green 
computing initiative [11]. 

In general, not only the energy consumption during the use of some device 
is important. Rather, the fabrication of the device should be considered as 
well, due to the energy consumption during fabrication. Hence, we should 
consider the entire life cycle of a product in the form of a so-called life-cycle 
assessment (LCA) [374]. It is feasible to reduce the impact on the environment 
by purchasing new devices less frequently. 

2. Run-time: Embedded systems should exploit the available hardware architec- 
ture as much as possible. Inefficient use of execution time (e.g., wasted pro- 
cessor cycles) should be avoided. This implies an optimization of execution 
times across all levels, from algorithms down to hardware implementations. 


>This can be demonstrated by means of an example. Consider a mobile phone battery having a 
capacity of 3600 mAh. We assume an average voltage of 4 V. This results in an energy of 14.4 Wh. 
A fully charged battery stores as much energy as is consumed by a typical residential gateway 
(turned on 24/7) in about 1—2.5h or a TV set in a fraction of an hour. 
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3. Code size: For some embedded systems, code typically has to be stored on 
the system itself. There may be tight constraints on the storage capacity of 
the system. This is especially true for systems on a chip (SoCs), systems for 
which all the information processing circuits are included on a single chip. If 
the instruction memory is to be integrated onto this chip, it should be used 
very efficiently. For example, there may be medical devices implanted into 
the human body. Due to size and communication constraints of such devices, 
code has to be very compact. 

However, the importance of this design goal might change, when dynam- 
ically loading code becomes acceptable or when larger memory densities 
(measured in bits per volume unit) become available. Flash-based memories 
and new memory technologies will potentially have a large impact. 

4. Weight: All portable systems must be lightweight. A low weight is frequently 
an important argument for buying a particular system. 

5. Cost: For high-volume embedded systems in mass markets, especially in 
consumer electronics, competitiveness on the market is an extremely crucial 
issue, and efficient use of hardware components and the software development 
budget are required. A minimum amount of resources should be used for 
implementing the required functionality. We should be able to meet require- 
ments using the least amount of hardware resources and energy. In order to 
reduce the energy consumption, clock frequencies and supply voltages should 
be as low as possible. Also, only the necessary hardware components should 
be present, and over-provisioning should be avoided. Components which do 
not improve the worst case execution time (such as many caches or memory 
management units) can sometimes be omitted. 


Due to resource awareness targets, software designs cannot be done indepen- 
dently of the underlying hardware. Therefore, software and hardware must be 
taken into account during the design steps. This, however, is difficult, since 
such integrated approaches are typically not taught at educational institutes. The 
cooperation between electrical engineering and computer science has not yet 
reached the required level. 

A mapping of specifications to custom hardware would provide the best 
energy efficiency. However, hardware implementations are very expensive and 
require long design times. Therefore, hardware designs do not provide the 
flexibility to change designs as needed. We need to find a good compromise 
between efficiency and flexibility. 

e CPS and IoT systems are frequently collecting huge amounts of data. These large 
amounts of data have to be stored and they have to be analyzed. Hence, there is a 
strong link between the problems of big data (or machine learning) and CPS/IoT. 
This is exactly the topic of our collaborative research center SFB 876.° SFB 876 
focuses on machine learning under resource constraints. 


®See http://www.sfb876.tu-dortmund.de. 
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¢ Impact beyond technical issues: Due to the major impact on society, legal, 
economic, social, human, and environmental impacts must be considered as 
well: 


— The integration of many components, possibly by different providers, raises 
serious issues concerning liability. These issues are being discussed, for 
example, for self-driving cars. Also, ownership issues must be solved. It is 
unacceptable to have one of the involved companies own all rights. 

— Social issues include the impact of new IT devices on society. This has led 
to the introduction of the term Cyber-Physical-Social Systems (CPSS) [140]. 
Currently, this impact is frequently only detected long after the technology 
became available. 

— Human issues comprise user-friendly man-machine interfaces. 

— Contributions to global warming and the production of waste should be at an 
acceptable level. The same applies to the consumption of resources. 


e Real systems are concurrent. Managing concurrency is therefore another major 
challenge. 

e Cyber-physical and IoT systems are typically consisting of heterogeneous hard- 
ware and software components from various providers and have to operate in 
a changing environment. The resulting heterogeneity poses challenges for the 
correct cooperation of components. It is not sufficient to consider only software 
or only hardware design. Design complexity requires adopting a hierarchical 
approach. Furthermore, real embedded systems consist of many components and 
we are interested in compositional design. This means, we would like to study 
the impact of combining components [213]. For example, we would like to know 
whether we could add a GPS system to the sources of information in a car without 
overloading the communication bus. 

e CPS design involves knowledge from many areas. It is difficult to find staff 
members with a sufficient amount of knowledge in all relevant areas. Even 
organizing the knowledge transfer between relevant areas is already challenging. 
Designing a curriculum for a program in CPS design is even more challenging, 
due to the tight ceilings for the total workload for students [379]. Overall, tearing 
down walls between disciplines and departments or at least lowering them 
would be required. 


A list of challenges is also included in a report on IoT by Sundmaeker et al. [516]. 


1.4 Common Characteristics 


In addition to the challenges listed above, there are more common characteristics of 
embedded, cyber-physical and IoT systems, independently of the application area. 
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e CPS and IoT systems use sensors and actuators to connect the embedded system 
to the physical environment. For IoT, these components are connected to the 
Internet. 


Definition 1.9 Actuators are devices converting numbers into physical effects. 
e Typically, embedded systems are reactive systems, which are defined as follows: 


Definition 1.10 (Bergé [567]) “A reactive system is one that is in continual 
interaction with its environment and executes at a pace determined by that 
environment.” 


Reactive systems are modeled as being in a certain state, waiting for an input. 
For each input, they perform some computation and generate an output and a 
new state. Hence, automata are good models of such systems. Mathematical 
functions, describing the problems solved by most algorithms, would be an 
inappropriate model. 

e Embedded systems are under-represented in teaching and in public discus- 
sions. Real embedded systems are complex. Hence, comprehensive equipment is 
required for realistically teaching embedded system design. However, teaching 
CPS design can be appealing, due to the visible impact on the physical behavior. 

e These systems are frequently dedicated toward a certain application. For 
example, processors running control software in a car or a train will typically 
always run that software, and there will be no attempt to run a game or 
spreadsheet program on the same processor. There are mainly two reasons for 
this: 


1. Running additional programs would make those systems less dependable. 
2. Running additional programs is only feasible if resources such as memory are 
unused. No unused resources should be present in an efficient system. 


However, the situation is slowly changing. For example, the AUTOSAR 
initiative [28] demonstrates more dynamism in the automotive industry. 

e Most embedded systems do not use keyboards, mice, and large computer 
monitors for their user interface. Instead, there is a dedicated user interface 
consisting of push buttons, steering wheels, pedals, etc. Because of this, the user 
hardly recognizes that information processing is involved. This is consistent with 
the introduction of the term disappearing computer. 


Table 1.2 highlights some distinguishing features between the designs of PC-like 
or data center server-like systems and embedded systems. 

Compatibility with traditional instruction sets employed for PCs is less impor- 
tant for embedded systems, since it is typically possible to compile software 
applications for architectures at hand. Sequential programming languages do not 
match well with the need to describe concurrent real-time systems, and other 
ways of modeling applications may be preferred. Several objectives must be 
considered during the design of embedded/cyber-physical systems. In addition to the 
average performance, the worst case execution time, energy consumption, weight, 
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Table 1.2 Distinction between PC-like and embedded system designs 


Embedded PC-/server-like 
Frequently heterogeneous Mostly homogeneous 
Architectures very compact not compact (x86, etc.) 
x86 compatibility Less relevant Very relevant 
Architecture fixed? Rarely Yes 
Models of computation (MoCs) | C+multiple models (data flow, | Mostly von Neumann (C, 
discrete events, ...) C++, Java) 
Optimization objectives Multiple (energy, size, ...) Average performance 
dominates 
Safety-critical? Possibly Usually not 
Real-time relevant Frequently Hardly 
Apps. known at design time Yes, for real-time systems Only some (e.g., WORD) 


reliability, operating temperatures, etc. may have to be optimized. Meeting real- 
time constraints is very important for CPS but hardly so for PC-like systems. Time 
constraints can be verified at design time only if all the applications are known 
at this time. Also, it must be known, which applications should run concurrently. 
For example, designers must ensure that a GPS application, a phone call, and data 
transfers can be executed at the same time without losing voice samples. For PC-like 
systems, knowledge about concurrently running software is almost never available 
and best effort approaches are typically used. 

Why does it make sense to consider all types of embedded systems in one book? 
It makes sense because information processing in embedded systems has many 
common characteristics, despite being physically so different. 

Actually, not every embedded system will have all the above characteristics. We 
can define the term “embedded system” also in the following way: 


Definition 1.11 Information processing systems meeting most of the characteris- 
tics listed above are called embedded systems. 


This definition includes some fuzziness. However, it seems to be neither neces- 
sary nor possible to remove this fuzziness. 


1.5 Curriculum Integration of Embedded Systems, 
CPS, and IoT 


Unfortunately, embedded systems are hardly covered in the 2013 edition of the 
Computer Science Curriculum, as published by ACM and the IEEE Computer 
Society [10]. However, the growing number of applications results in the need for 
more education in this area. This education should help overcome the limitations of 
currently available design technologies. Surveys of requirements and approaches 
to CPS education have been published by the National Academies of Sciences, 
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Engineering, and Medicine [409] and by Marwedel et al. [379]. There is still a 
need for better specification languages, models, tools generating implementations 
from specifications, timing verifiers, system software, real-time operating systems, 
low-power design techniques, and design techniques for dependable systems. This 
book should help in teaching the essential issues and should be a stepping stone for 
starting more research in the area. Additional information related to the book can 
be obtained from the following web page: http://ls12-www.cs.tu-dortmund.de/~ 
marwedel/es- book 

This page includes links to slides, videos, simulation tools, error corrections, and 
other related materials. Videos are directly accessible from: https://www.youtube. 
com/user/cyphysystems 

Users of this material who discover errors or who would like to make 
comments on how to improve the material should send an e-mail to: 
peter.marwedel @tu-dortmund.de 

Due to the availability of this book and of videos, it is feasible and recommended 
to try out flipped classroom teaching [375]. With this style of teaching, students 
are requested to watch the videos (or read the book) at home. The presence of the 
students in the classroom is then used to interactively solve problems. This helps to 
strengthen problem-solving competences, team work, and social skills. In this way, 
the availability of the Internet is exploited to improve teaching methods for students 
actually present at their university. Assignments could use the information in this or 
in complementary books (e.g., [593], [81], and [174]). 

With flipped classroom teaching, existing lab session slots can be completely 
dedicated to gaining some practical experience with CPS. Toward this end, a course 
using this textbook should be complemented by an exciting lab, using, for example, 
small robots, such as Lego Mindstorms” or micro-controllers (e.g., Raspberry 
Pie, Arduino, or Odroid). For micro-controller boards which are available on the 
market, educational material is typically available. Another option is to let students 
gain some practical experience with finite state machine tools. Teaching from this 
book should be complemented by a course on machine learning (or data analysis) 
[188, 204, 453, 560], since the (possibly noisy) values returned by sensors must be 
interpreted. 


1.5.1 Prerequisites 


The book assumes a basic understanding in several areas: 


e Computer programming (including foundations of software engineering and 
some experiences with programming of micro-controllers) 

e Algorithms (graph algorithms, optimization algorithms, algorithm complexity) 

e Computer organization, for example, at the level of the introductory book by J.L. 
Hennessy and D.A. Patterson [212], including finite state automata 

e Fundamentals of operating systems 


20 1 Introduction 
pro- algorithms computer OS & math EE funda- 
gramming 9 organization] | networks | | education| | mentals 

+courses lab 
for minor | data : 
analysis<->| embedded system fundamentals of project 
y cyber-physical and loT systems - 
undergraduates thesis 
graduates i 
control || digital signal | machine | | real-time || robotics appli- middle- 
systems | | processing vision systems cations || ware 


Fig. 1.7 Positioning of the topics of this book 


e Fundamentals of computer networks (important for IoT!) 
e Fundamental mathematical concepts (tuples, integrals, and linear algebra) 
e Electrical networks and fundamental digital circuits such as gates and registers 


These prerequisites can be grouped into the courses in the top row of Fig. 1.7. 

Missing fundamental knowledge on electrical circuits, operational amplifiers, 
memory management, and integer linear programming can be compensated by 
reading appendices of this book. Knowledge in statistics and Fourier transforms 


are welcome. 


1.5.2 Recommended Additional Courses 


The book should be complemented by follow-up courses providing a more special- 
ized knowledge in some of the following areas (see the bottom row in Fig. 1.7):’ 


e Control systems 


e Digital signal processing 


e Machine vision 


e Real-time systems, real-time operating systems, and scheduling 


e Robotics 


e Application areas such as telecommunications, automotive, medical equipment, 


and smart homes 
e Middleware 


7The partitioning between undergraduate courses and graduate courses may differ between 


universities. 
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e Specification languages and models for embedded systems 
e Sensors and actuators 

e Dependability of computer systems 

e Low-power design techniques 

e Physical aspects of CPS 

e Computer-aided design tools for application-specific hardware 
e Formal verification of hardware systems 

e Testing of hardware and software systems 

e Performance evaluation of computer systems 

e Ubiquitous computing 

e Advanced communication techniques for IoT 

° The Internet of Things (IoT) 

e Impact of embedded, CPS, and IoT systems 

e Legal aspects of embedded, CPS, and IoT systems 


1.6 Design Flows 


The design of the considered systems is a rather complex task, which has to be 
broken down into a number of subtasks to be tractable. These subtasks must be 
performed one after the other and some of them must be repeated. 

The design information flow starts with ideas in people’s heads. These ideas 
should incorporate knowledge about the application area. They must be captured 
in a design specification. In addition, standard hardware and system software 
components are typically available and should be reused whenever possible (see 
Fig. 1.8). In Fig. 1.8 (as well as in other similar diagrams in this book), we are 
using boxes with rounded corners for stored information and rectangles for 
transformations on information. In particular, information is stored in the design 
repository. The repository allows keeping track of design models. In most cases, 
the repository should provide version management or “revision control,” such as 
CVS [87], SVN [108], or “git” (see https://www.git-scm.com). A good design 
repository should also come with a design management interface which would also 
keep track of the applicability of design tools and sequences, all integrated into 
a comfortable graphical user interface (GUI). The design repository and the GUI 
can be extended into an integrated development environment (IDE), also called 
design framework (see, e.g., [345]). An integrated development environment keeps 
track of dependencies between tools and design information. 

Using the repository, design decisions can be taken in an iterative fashion. At 
each step, design model information must be retrieved. This information is then 
considered. 

During design iterations, applications are mapped to execution platforms, 
and new (partial) design information is generated. The generation comprises the 
mapping of operations to concurrent tasks, the mapping of operations to either 
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Fig. 1.8 Simplified design information flow 


hardware or software (called hardware/software partitioning), compilation, and 
scheduling. 

Designs should be evaluated with respect to various objectives including 
performance, dependability, energy consumption, and thermal behavior. At the 
current state of the art, usually none of the design steps can be guaranteed to be 
correct. Therefore, it is also necessary to validate the design. Validation consists of 
checking intermediate or final design descriptions against other descriptions. Thus, 
each design decision should be evaluated and validated. 

Due to the importance of the efficiency of embedded systems, optimizations 
are important. There are a large number of possible optimizations, including high- 
level transformations (such as advanced loop transformations) and energy-oriented 
optimizations. 

Design iterations could also include test generation and an evaluation of the 
testability. Testing needs to be included in the design iterations if testability issues 
are already considered during the design steps. In Fig. 1.8, test generation has been 
included as optional step of design iterations (see the dashed box). If test generation 
is not included in the iterations, it must be performed after the design has been 
completed. 

At the end of each step, the repository should be updated. Version support would 
be welcome. 

Details of the flow between the repository, application mapping, evaluation, vali- 
dation, optimization, testability considerations, and storage of design information 
may vary. These actions may be interleaved in many different ways, depending 
on the design methodology used. This book presents embedded system design 
from a broad perspective, and it is not tied toward particular design flows or tools. 
Therefore, we have not indicated a particular list of design steps. For any particular 
design environment, we can “unroll” the loop in Fig. 1.8 and attach names to 
particular design steps. 

For example, this leads to the particular case of the SpecC [173] design flow 
shown in Fig. 1.9. In this case, a particular set of design steps, such as architecture 
exploration, communication synthesis, and software and hardware compilation are 
included. The precise meaning of these terms is not relevant in this book. In the case 
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Fig. 1.10 Design flow for the V-model 


of Fig. 1.9, validation and evaluation are explicitly shown for each of the steps but 
are wrapped into one larger box. 

A second instance of an unfolded Fig. 1.8 is shown in Fig. 1.10. It is the V-model 
of design flows [550], which has to be adhered to for many German IT projects. 

The model is used especially in the public sector but also beyond. Figure 1.10 
very clearly shows the different steps that must be performed. The steps correspond 
to certain phases during the software development process (the precise meaning is 
again not relevant in the context of this book). Note that taking design decisions 
and evaluating and validating designs are lumped into a single box in this diagram. 
Application knowledge, system software, and system hardware are not explicitly 
shown. The V-model also includes a model of the integration and testing phase 
(right “wing”) of the diagram. This corresponds to an inclusion of testing into the 
integration phase. The shown model corresponds to the V-model version “97”. The 
more recent V-model XT allows a more general set of design steps. This change 
matches very well to our interpretation of design flows in Fig. 1.8. Other iterative 
approaches include the waterfall model and the spiral model. More information 
about software engineering for embedded systems can be found in a book by J. 
Cooling [109]. 

Our generic design flow model is also consistent with flow models used in 
hardware design. For example, Gajski’s Y-chart [171] (see Fig. 1.11) is a very 
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popular model. Gajski considers design information in three dimensions: behavior, 
structure, and layout. The first dimension just reflects the behavior. A high-level 
model would describe the overall behavior, while finer-grained models would 
describe the behavior of components. Models at the second dimension include 
structural information, such as information about hardware components. High- 
level descriptions in this dimension could correspond to processors and low-level 
descriptions to transistors. The third dimension represents geometrical layout infor- 
mation of chips. Design paths will typically start with a coarse-grained behavioral 
description and finish with a fine-grained geometrical description. Along this path, 
each step corresponds to one iteration of our generic design flow model. In the 
example of Fig. 1.11, an initial refinement is done in the behavioral domain. The 
second design step maps the behavior to structural elements and so on. Finally, a 
detailed geometrical description of the chip layout is obtained. 

The previous three diagrams demonstrate that a number of design flows are using 
the iterative flow of Fig. 1.8. The nature of the iterations in Fig. 1.8 can be a source 
of discussions. Ideally, we would like to describe the properties of our system and 
then let some smart tool do the rest. Automatic generation of design details is called 
synthesis. 


Definition 1.12 (Marwedel [370]) “Synthesis is the process of generating the 
description of a system in terms of related lower-level components from some high- 
level description of the expected behavior.” 


Automatic synthesis is assumed to perform this process automatically. Automatic 
synthesis, if successful, avoids many manual design steps. The goal of using 
automatic synthesis for the design of systems has been considered in the “describe- 
and-synthesize” paradigm by Gajski [172]. This paradigm is in contrast to the more 
traditional “specify-explore-refine” approach, also known as “design-and-simulate” 
approach. The second term stresses the fact that manual design typically has to be 
combined with simulation, for example, for catching design errors. In the traditional 
approach, simulation is more important than in automatic synthesis. 
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1.7 Structure of This Book 


Consistent with the design information flow shown above, this book is structured as 
follows: Chapter 2 provides an overview of specification techniques, languages, and 
models. Key hardware components of embedded systems and the cyphy-interface 
are presented in Chap. 3. Chapter 4 deals with system software components, partic- 
ularly embedded operating systems. Chapter 5 contains the essentials of embedded 
system design evaluation and verification. Mapping applications to execution 
platforms is one of the key steps in the design process of embedded systems. 
Standard techniques (including scheduling) for achieving such mapping are listed 
in Chap.6. Due to the need for generating efficient designs, many optimization 
techniques are needed. From among the abundant set of available optimization 
techniques, several groups are mentioned in Chap. 7. Chapter 8 contains a brief 
introduction to testing mixed hardware/software systems. The Appendix comprises 
prerequisites for understanding the book, and it can be skipped by students familiar 
with the topics covered there. 

It may be necessary to design special-purpose hardware or to optimize processor 
architectures for a given application. However, hardware design is not covered in 
this book. Coussy and Morawiec [113] provide an overview of high-level hardware 
synthesis techniques. 

The content of this book is different from the content of most other books on 
embedded systems or CPS design. Traditionally, the focus of many such books is on 
explaining the use of micro-controllers, including their memory, I/O, and interrupt 
structure. There are many such books [38, 175-177, 279, 317, 425]. We believe 
that, due to the increasing complexity of embedded and cyber-physical systems, 
this focus has to be extended to include at least different specification paradigms, 
fundamentals of hardware building blocks, the mapping of applications to execution 
platforms, as well as evaluation, validation, and optimization techniques. In the 
current book, we will be covering all these areas. The goal is to provide students 
with an introduction to embedded systems and CPS, enabling students to put the 
different areas into perspective. 

For further details, we recommend a number of sources (some of which have also 
been used in preparing this book): 


e Symposia dedicated toward embedded/cyber-physical systems include the 
Embedded Systems Week (see http://www.esweek.org) and the Cyber-Physical 
Systems Week (see http://www.cpsweek.org). 

e The web site of the virtual CPS Organization in the USA contains numerous links 
to current projects and their results [115]. 

e The web page of a special interest group of ACM [9] focuses on embedded 
systems. 

e The web site of the European network of excellence on embedded and real-time 
systems [25] also provides numerous links for the area. 

e A book written by Edward Lee et al. also includes physical aspects of cyber- 
physical systems [335]. 
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e Approaches for embedded system education are covered in the Workshops on 
Embedded Systems Education (WESE; see [89] for results from the workshop 
held in 2018) and in proceedings of the first (and only) Workshop on CPS 
Education [424]. 

e Other sources of information about embedded systems include books by Laplante 
[322], Vahid [552], the ARTIST road map [63], the “Embedded Systems 
Handbook” [614], and books by Gajski et al. [174], and Popovici et al. [457]. 

e There are a large number of sources of information on specification languages. 
These include earlier books by Young [609], Burns and Wellings [80], Bergé 
[567], and de Micheli [124]. There are a huge amount of information on 
languages such as SystemC [407], SpecC [173], and Java [71, 131, 574]. 

e Real-time scheduling is covered comprehensively in the books by Buttazzo [81], 
by Krishna and Shin [310], and by Baruah et al. [41]. 

e Approaches for designing and using real-time operating systems (RTOSes) are 
presented in a book by Kopetz [303]. 

e Robotics is an area that is closely linked to embedded and cyber-physical 
systems. We recommend the book by Siciliano et al. [487] for information on 
robotics. 

e There are specialized books and articles on the Internet of Things [185, 192, 193]. 

e Languages and verification are covered in a book by Haubelt and Teich (in 
German) [206]. 


1.8 Problems 


We suggest solving the following problems either at home or during a flipped 
classroom session [375]. 


1.1 Please list possible definitions of the term “embedded system”! 


1.2 How would you define the term “cyber-physical system (CPS)? Do you 
see any difference between the terms “embedded systems” and “cyber-physical 
systems”? 


1.3 What is the “Internet of Things” (IoT)? 
1.4 What is the goal of “Industry 4.0”? 
1.5 In which way does this book cover CPS and IoT design? 


1.6 In which application areas do you see opportunities for CPS and IoT systems? 
Where do you expect major changes caused by information technology? 


1.7 Use the sources available to you to demonstrate the importance of embedded 
systems! 


1.8 Which challenges must be overcome in order to fully take advantage of the 
opportunities? 
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1.9 What is a hard timing constraint? What is a soft timing constraint? 
1.10 What is the “Zeno effect”? 
1.11 What is adaptive sampling? 


1.12 Which objectives must be considered during the design of embedded and 
cyber-physical systems? 


1.13 Why are we interested in energy-aware computing? 


1.14 What are the main differences between PC-based applications and embed- 
ded/CPS applications? 


1.15 What is a reactive system? 
1.16 On which web sites do you find companion material for this book? 


1.17 Compare the curriculum of your educational program with the description 
of the curriculum in this introduction. Which prerequisites are missing in your 
program? Which advanced courses are available? 


1.18 What is flipped classroom teaching? 
1.19 How could we model design flows? 
1.20 What is the “V-model’’? 


1.21 How could we define the term “synthesis”? 
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Chapter 2 A 
Specifications and Modeling P 


How can we describe the system which we would like to design and how can we 
represent intermediate design information? Models and description techniques for 
initial specifications as well as for intermediate design information will be shown 
in this chapter. First of all, we will capture requirements for modeling techniques. 
Next, we will provide an overview of models of computation. This will be followed 
by a presentation of popular models of computations, in combination with examples 
of the corresponding languages. The presentation includes models for early design 
phases, automata-based models, data-flow, Petri nets, discrete event models, von 
Neumann languages, and abstraction levels for hardware modeling. Finally, we will 
compare different models of computation and present exercises. 


2.1 Requirements 


Consistent with the simplified design flow (see Fig. 1.8), we will first of all describe 
requirements and approaches for specifying embedded and cyber-physical systems. 
Specifications for such systems provide models of the system under design (SUD). 
Models can be defined as follows: 


Definition 2.1 (Jantsch [268]) “A model is a simplification of another entity, which 
can be a physical thing or another model. The model contains exactly those 
characteristics and properties of the modeled entity that are relevant for a given 
task. A model is minimal with respect to a task if it does not contain any other 
characteristics than those relevant for the task”. 
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the following features: 


2 Specifications and Modeling 


Models are described in languages. Languages should be capable of representing 
1 


Hierarchy: Humans generally cannot comprehend systems containing many 
objects (states, components) having complex relations with each other. The 
description of all real-life systems needs more objects than humans can under- 
stand. Hierarchy (in combination with abstraction) is a key mechanism helping 
to solve this dilemma. Hierarchies can be introduced such that humans need to 
handle only a small number of objects at any time. 

There are two kinds of hierarchies: 


— Behavioral hierarchies: Behavioral hierarchies are hierarchies containing 
objects necessary to describe the system behavior. States, events, and output 
signals are examples of such objects. 

— Structural hierarchies: Structural hierarchies describe how systems are 
composed of physical components. 

For example, embedded systems can be comprised of components such 
as processors, memories, actuators, and sensors. Processors, in turn, include 
registers, multiplexers, and adders. Multiplexers are composed of gates. 


Component-based design [489]: It must be “easy” to derive the behavior of a 
system from the behavior of its components. If two components are connected, 
the resulting new behavior should be predictable. For example, suppose that 
we add another component (say, some GPS unit) to a car. The impact of the 
additional component on the overall behavior of the system (including buses, 
etc.) should be predictable. 

Concurrency: Real-life systems are distributed, concurrent systems composed 
of components. It is therefore necessary being able to specify concurrency 
conveniently. Unfortunately, humans are not very good at understanding con- 
current systems, and many problems with real systems are actually a result of an 
incomplete understanding of possible behaviors of concurrent systems. 
Synchronization and communication: Components must be able to com- 
municate and to synchronize. Without communication, components could not 
cooperate, and we would use each of them in isolation. It must also be possible 
to agree on the use of resources. For example, it is necessary to express mutual 
exclusion. 

Timing behavior: Many embedded and cyber-physical systems are real-time 
systems. Therefore, explicit timing requirements are one of the characteristics 
of such systems. The need for explicit modeling of time is very obvious 
from the term “cyber-physical system.” Time is one of the key dimensions in 
physics. Hence, timing requirements must be captured in the specification of 
embedded/cyber-physical systems. 


‘Information from the books of Burns et al. [80], Bergé et al. [567], and Gajski et al. [172] is used 
in this list. 
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However, standard theories in computer science model time only in a very 
abstract way. The O-notation is one of the examples.” This notation just reflects 
growth rates of functions. It is frequently used to model run-times of algorithms, 
but it fails to describe real execution times. In physics, quantities have units, 
but the O-notation does not even have units. So, it would not distinguish between 
femtoseconds and centuries. A similar remark applies to termination properties of 
algorithms. Standard theories are concerned with proving that a certain algorithm 
eventually terminates. For real-time systems, we need to show that certain 
computations are completed in a given amount of time, but the algorithm as a 
whole should possibly run until power is turned off. 

According to Burns and Wellings [80], modeling time must be possible in the 
following four contexts: 


— Techniques for measuring elapsed time: 
For many applications, it is necessary to check how much time has elapsed 
since some computation was performed. Access to a timer would provide a 
mechanism for this. 

— Means for delaying of processes? for a specified time: 
Typically, real-time languages provide some delay construct. Unfortunately, 
typical implementations of embedded systems in software do not guarantee 
precise delays. Let us assume that process t should be delayed by some 
amount A. Usually, this delay is implemented by changing t’s state in 
the operating system from “ready” or “run” to “suspended.” At the end of 
this time interval, t’s state is changed from “suspended” to “ready.” This does 
not mean that the process actually executes. If some higher-priority task is 
executing or if preemption is not used, the delay will be larger than A. 

— Possibility to specify timeouts: 
There are many situations in which we must wait for a certain event to occur. 
However, this event may actually not occur within a given time interval, and 
we would like to be notified about this. For example, we might be waiting 
for a response from some network connection. We would like to be notified 
if this response is not received within some amount of time, say A. This 
is the purpose of timeouts. Real-time languages usually also provide some 
timeout construct. Implementations of timeouts frequently come with the 
same problems which we mentioned for delays. 

— Methods for specifying deadlines and schedules: 
For many applications, it is necessary to complete certain computations in a 
limited amount of time. For example, if the sensors of some car signal an 
accident, airbags must be ignited within about 10 ms. In this context, we must 
guarantee that the software will decide whether or not to ignite the airbags 
in that given amount of time. The airbags could harm passengers if they go 


?We assume that readers are familiar with this notation, as explained on p. 19. 
3Processes are programs currently being executed; see Definition 2.3. 
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Fig. 2.1 State diagram with 
exception k 


2 Specifications and Modeling 


off too late. Unfortunately, most languages do not allow to specify timing 
constraints. If they can be specified at all, they must be specified in separate 
control files, pop-up menus, etc. But the situation is still bad even if we are 
able to specify these constraints: many modern hardware platforms do not 
have a very predictable timing behavior. Caches, stalled pipelines, speculative 
execution, process preemption, interrupts, etc. may have an impact on the 
execution time which is very difficult to predict. Accordingly, timing analysis 
(verifying the timing constraints) is a very hard design task. 


State-oriented behavior: It was already mentioned in Chap.1 on p. 17 that 
automata provide a good mechanism for modeling reactive systems. Therefore, 
the state-oriented behavior provided by automata should be easy to describe. 
However, classical automata models are insufficient, since they cannot model 
timing and since hierarchy is not supported. 

Event-handling: Due to the reactive nature of embedded systems, mechanisms 
for describing events must exist. Such events may be external events (caused by 
the environment) or internal events (caused by components of the system under 
design). 

Exception-oriented behavior: In many practical systems, exceptions do occur. 
In order to design dependable systems, it must be possible to describe actions to 
handle exceptions easily. It is not acceptable that exceptions must be indicated 
for each and every state (such as in the case of classical state diagrams). 


Example 2.1 In Fig.2.1, input k might correspond to an exception. 

Specifying this exception at each state makes the diagram very complex. The 
situation would get worse for larger state diagrams with many transitions. On 
p. 52 we will show how all the transitions can be replaced by a single one (see 
Fig. 2.12). V 


Presence of programming elements: Popular programming languages have 
proven to be a convenient means of expressing computations that must be 
performed. Hence, programming language elements should be available in 
the specification technique used. Classical state diagrams do not meet this 
requirement. 

Executability: Specifications are not automatically consistent with the ideas in 
people’s heads. Executing the specification is a means of plausibility checking. 
Specifications using programming languages have a clear advantage in this 
context. 
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e Support for the design of large systems: There is a trend toward large 
and complex embedded software programs. Software technology has found 
mechanisms for designing such large systems. For example, object orientation 
is one such mechanism. It should be available in the specification methodology. 

e Domain-specific support: It would of course be nice if the same specification 
technique could be applied to all the different types of embedded systems, since 
this would minimize the effort for developing specification techniques and tool 
support. However, due to the wide range of application domains including those 
listed in Sect. 1.2, there is little hope that one language can be used to efficiently 
represent specifications in all such domains. For example, control-dominated, 
data-dominated, centralized, and distributed application domains can all benefit 
from language features dedicated toward those domains. 

e Readability: Of course, specifications must be readable by humans. Otherwise, 
it would not be feasible to validate whether or not the specification meets 
the real intent of the persons specifying the system under design. All design 
documents should also be machine-readable in order to process them in a 
computer. Therefore, specifications should be captured in languages which are 
readable by humans and by computers. 

Initially, such specifications could use a natural language such as English 
or Japanese. Even this natural language description should be captured in a 
design document, so that the final implementation can be checked against 
the original document. However, natural languages are not sufficient for later 
design phases, since natural languages lack key requirements for specification 
techniques: it is necessary to check specifications for completeness and absence 
of contradictions. Furthermore, it should be possible to derive implementations 
from the specification in a systematic way. Natural languages do not meet these 
requirements. 

e Portability and flexibility: Specifications should be independent of specific 
hardware platforms so that they can be easily used for a variety of target 
platforms. Ideally, changing the hardware platform should have no impact on 
the specification. In practice, small changes may have to be tolerated. 

e Termination: It should be feasible to identify terminating processes from the 
specification. This means that we would like to use specifications for which the 
halting problem (the problem of figuring out whether or not a certain algorithm 
will terminate; see, e.g., [494]) is decidable. 

e Support for non-standard I/O devices: Many embedded systems use I/O 
devices other than those typically found in a PC. It should be possible to describe 
inputs and outputs for those devices conveniently. 

e Non-functional properties: Actual systems under design must exhibit a number 
of non-functional properties, such as fault tolerance, size, extendability, expected 
lifetime, power consumption, weight, disposability, user-friendliness, and elec- 
tromagnetic compatibility (EMC). There is no hope that all these properties can 
be defined in a formal way. 
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¢ Support for the design of dependable systems: Specification techniques should 
provide support for designing dependable systems. For example, specification 
languages should have unambiguous semantics, facilitate formal verification, and 
be capable of describing security and safety requirements. 

e No obstacles to the generation of efficient implementations: Since embedded 
systems must be efficient, no obstacles prohibiting the generation of efficient 
realizations should be present in the specification. 

e Appropriate model of computation (MoC): The von Neumann model of 
sequential execution combined with some communication technique is a com- 
monly used MoC. In this model, specifications will typically consist of tasks, 
processes, or threads, which can be defined as follows: 


Definition 2.2 ([393]) A task is an “assigned piece of work often to be finished 
within a certain time”. 


In the context of embedded systems, tasks will typically correspond to computa- 
tions that have to be performed. 


Definition 2.3 ([525]) A process is a program being executed. 


A more precise definition will be provided in Definition 4.1. Sometimes, tasks are 
more abstract than processes. In this case, they have to be mapped to processes 
within an operating system. However, sometimes the terms “process” and “task” 
are used interchangeably. The term “thread” is very similar to the term “process.” 


Definition 2.4 A thread is a lightweight process. This means that switching 
between the execution of threads causes less overhead than switching between 
processes. Usually, threads can communicate with each other via shared memory. 


The term “thread” will be more precisely defined in Definition 4.2. 
The von Neumann model has a number of serious problems, in particular for 
embedded system applications. Problems include: 


— Facilities for describing timing are lacking. 

— von Neumann computing is implicitly based on accesses to globally shared 
memory (such as in Java). It has to guarantee mutually exclusive access to 
shared resources. Otherwise, multithreaded applications allowing preemp- 
tions at any time can lead to very unexpected program behaviors.* Using 
primitives for ensuring mutually exclusive access can, however, very easily 
lead to deadlocks. Possible deadlocks may be difficult to detect and may 
remain undetected for many years. 


Example 2.2 Edward Lee [331] provided a very alarming example in this 
direction. Lee studied implementations of a simple observer pattern in Java. 
For this pattern, changes of values must be propagated from some producer to 
a set of subscribed observers. This is a very frequent pattern in embedded sys- 


4Examples are typically provided in courses on operating systems. 
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tems but is difficult to implement correctly in a multithreaded von Neumann 
environment with preemptions. Lee’s code is a possible implementation of the 
observer pattern in Java for a multithreaded environment: 


public synchronized void addListener(listener) {...} 
public synchronized void setValue(newvalue) { 
myvaLlue=newvaLue; 
for (int i=; i<mylisteners.length; i++) { 
myListeners[i].valueChanged(newvalue) ; 


} 
} 


Method addListener subscribes new observers; method setVaLue prop- 
agates new values to subscribed observers. In general, in a multithreaded 
environment, threads can be preempted any time, resulting in an arbitrarily 
interleaved execution of these threads. Adding observers while setValue 
is already active could result in complications, i.e., we would not know if 
the new value had reached the new listener. Moreover, the set of observers 
constitutes a global data structure of this class. Therefore, these methods are 
synchronized in order to avoid changing the set of observers while values are 
already partially propagated. This way, only one of the two methods can be 
active at a given time. This mutual exclusion is necessary to prevent unwanted 
interleavings of the execution of methods in a multithreaded environment. 
Why is this code problematic? It is problematic since valueChanged could 
attempt to get exclusive access to some resource (say, R). If that resource is 
allocated to some other method (say, A), then this access is delayed until A 
releases R. If A calls (possibly indirectly) addListener or setValue before 
releasing R, then these methods will be in a deadlock: setValue waits for 
R; releasing R requires A to proceed; A cannot proceed before its call of 
setVaLlue or addListener is serviced. Hence, we will have a deadlock. 

This example demonstrates the existence of deadlocks resulting from using 
multiple threads which can be arbitrarily preempted and therefore require 
mutual exclusion for their access to critical resources. Lee showed [331] that 
many of the proposed “solutions” of the problem are problematic themselves. 
So, even this very simple pattern is difficult to implement correctly in a multi- 
threaded von Neumann environment. This example shows that concurrency 
is really difficult to understand for humans and there may be the risk of 
oversights, even after very rigorous code inspections. V 


Lee came to the drastic conclusion that “nontrivial software written with 
threads, semaphores, and mutexes is incomprehensible to humans” and that 
“threads as a concurrency model are a poor match for embedded systems. 
... they work well only ... where best-effort scheduling policies are sufficient” 
[330]. 

The underlying reasons for deadlocks have been studied in detail in the con- 
text of operating systems (see, e.g., [507]). From this context, it is well-known 
that four conditions must hold at run-time to get into a deadlock: mutual 


36 2 Specifications and Modeling 


exclusion, no preemption of resources, holding resources while waiting for 
more, and a cyclic dependency between threads. All four conditions are met 
in the above example. The theory of operating systems provides no general 
way out of this problem. Rare deadlocks may be acceptable for a PC, but they 
are clearly unacceptable for a safety-critical system. 


We would like to specify our SUD such that we do not have to care about possible 
deadlocks. Therefore, it makes sense to study non-von Neumann MoCs avoiding 
this problem. We will study such MoCs from the next section onward. It will be 
shown that the observer pattern can be easily implemented in other MoCs. 


From the list of requirements, it is already obvious that there will not be any 
single formal language meeting all these requirements. Therefore, in practice, we 
must live with compromises and possibly also with a mixture of languages (each of 
which would be appropriate for describing a certain type of problems). The choice 
of the language used for an actual design will depend on the application domain and 
the environment in which the design has to be performed. In the following, we will 
present a survey of languages that can be used for actual designs. These languages 
will demonstrate the essential features of the corresponding MoC. 


2.2 Models of Computation 


Models of computation (MoCs) describe the mechanism assumed for performing 
computations. In the general case, we must consider systems comprising compo- 
nents. It is now common practice to strictly distinguish between the computations 
performed in the components and communication. This distinction paves the way 
for reusing components in different contexts and enables plug-and-play for system 
components. Accordingly, we define models of computation as follows [267- 
269, 329]: 


Definition 2.5 Models of computation (MoCs) define 


e Components and the organization of computations in components: Procedures, 
processes, functions, and finite state machines are possible components. 

e Communication protocols: These protocols describe methods for communica- 
tion between components. Asynchronous message passing and rendezvous-based 
communication are examples of communication protocols. 


Relations between components can be captured in graphs. In such graphs, we 
will refer to the computations also as processes or tasks. Accordingly, relations 
between these will be captured by task graphs and process networks. Nodes in 
the graph represent components performing computations. Computations map input 
data streams to output data streams. Computations are sometimes implemented in 
high-level programming languages. Typical computations contain (possibly non- 
terminating) iterations. In each cycle of the iteration, they consume data from their 
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inputs, process the data received, and generate data on their output streams. Edges 
represent relations between components. We will now introduce these graphs at a 
more detailed level. 

The most obvious relation between computations is their causal dependence: 
many computations can only be executed after other computations have terminated. 
This dependence is typically captured in dependence graphs. Figure 2.2 shows a 
dependence graph for a set of computations. 


Fig. 2.2 Dependence graph 


Definition 2.6 A dependence graph is a directed graph G = (t, E), where t is the 
set of vertices or nodes and E is the set of edges. E C t x t imposes a relation on 
T. If (t1, T2) € E with t1, t2 € T, then qı is called an immediate predecessor of t2, 
and t? is called an immediate successor of t1. Let E* be the transitive closure of E. 
If (t1, T2) € E*, then qı is called a predecessor of t2, and T2 is called a successor 
of 7. 


Such dependence graphs form a special case of task graphs. Task graphs may 
contain more information than modeled in Fig. 2.2. For example, task graphs may 
include the following extensions of dependence graphs: 


1. Timing information: Tasks may have arrival times, deadlines, periods, and 
execution times. In order to show them graphically, it may be useful to include 
this information in the graphs. However, we will indicate such information 
separately from the graphs in this book. 

2. Distinction between different types of relations between computations: Prece- 
dence relations just model constraints for possible execution sequences. At a 
more detailed level, it may be useful to distinguish between constraints for 
scheduling and communication between computations. Communication can also 
be described by edges, but additional information may be available for each of 
the edges, such as the time of the communication and the amount of information 
exchanged. Precedence edges may be kept as a separate type of edges, since there 
could be situations in which computations must execute sequentially even though 
they do not exchange information. 

In Fig. 2.2, input and output (I/O) are not explicitly described. Implicitly it 
is assumed that computations without any predecessor in the graph might be 
receiving input at some time. Also, they might generate output for the successor, 
and this output could be available only after the computation has terminated. It 
is often useful to describe input and output more explicitly. In order to do this, 
another kind of relation is required. Using the same symbols as Thoen [538], 
we use partially filled circles for denoting input and output. In Fig. 2.3, partially 
filled circles identify I/O edges. 
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Fig. 2.3 Graph including I/O (e) 
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3. Exclusive access to resources: Computations may be requesting exclusive 
access to some resource, for example, to some input/output device or some 
communication area in memory. Information about necessary exclusive access 
should be taken into account during scheduling. Exploiting this information 
might, for example, be used to avoid the priority inversion problem (see p. 213). 
Information concerning exclusive access to resources can be included in the 
graphs. 

4. Periodic schedules: Many computations, especially in digital signal processing, 
are periodic. This means that we must distinguish more carefully between a task 
and its execution (the latter is frequently called a job [347]).5 Graphs for such 
schedules are infinite. Figure 2.4 shows a graph including jobs J„—1 to Jn41 of a 
periodic task. 

5. Hierarchical graph nodes: The complexity of the computations denoted by 
graph nodes may be quite different. On the one hand, specified computations may 
be quite involved and contain thousands of lines of program code. On the other 
hand, programs can be split into small pieces of code so that in the extreme 
case, each of the nodes corresponds only to a single operation. The graph node 
complexity is also called their granularity. Which granularity should be used? 
There is no universal answer to this. For some purposes, the granularity should 
be as large as possible. For example, if we consider each of the nodes as one 
process to be scheduled by a real-time operating system (RTOS), it may be wise 
to work with large nodes in order to minimize context switches between different 
processes. For other purposes, it may be better to work with nodes modeling just a 
single operation. For example, nodes must be mapped to hardware or to software. 
If a certain operation (such as the frequently used discrete cosine transform, or 
DCT) can be mapped to special-purpose hardware, then it should not be buried in 
a complex node that contains many other operations. It should rather be modeled 
as its own node. In order to avoid frequent changes of the granularity, hierarchical 
graph nodes are very useful. For example, at a high hierarchical level, the nodes 
may denote complex tasks, at a lower-level basic blocks,° and at an even lower- 


>This term will be defined more precisely in Definitions 4.4 and 6.1. 


Basic blocks are code blocks of maximum length not including any branch except possibly at 
their end and not being branched into. 
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Fig. 2.5 Hierarchical task 
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level individual arithmetic operations. Figure 2.5 shows a hierarchical version of 
the dependence graph in Fig. 2.2, using a rectangle to denote a hierarchical node. 


As indicated above, MoCs can be classified according to the models of communi- 
cation (reflected by edges in the task graphs) and the model of computations within 
the components (reflected by the nodes in the task graphs). In the following, we will 
explain prominent examples of such models: 


¢ Models of communication: 


We distinguish between two communication paradigms: shared memory and 


message passing. Other communication paradigms exist (e.g., entangled states 
in quantum mechanics [62]), but are not considered in this book. 


Shared memory: For shared memory, communication is performed by 
accesses to the same memory from all components. Access to shared memory 
should be protected, unless access is restricted to reads. If writes are involved, 
exclusive access to the memory must be guaranteed while components are 
accessing shared memories. Segments of program code, for which exclusive 
access must be guaranteed, are called critical sections. Mechanisms for 
guaranteeing exclusive access to resources include semaphores, mutexes, 
conditional critical regions, monitors, and spin locks (see books on operating 
systems like Stallings [507]). Shared memory-based communication can be 
fast but is difficult to implement in multiprocessor systems without a common 
physical memory. 

Message passing: In this case, messages are sent and received. Message 
passing can be implemented easily even if no common memory is available. 
However, message passing is generally slower than shared memory-based 
communication. We distinguish between three kinds of message passing: 


Asynchronous message passing, also called non-blocking communi- 
cation: In asynchronous message passing, components communicate by 
sending messages through channels which can buffer the messages. The 
sender does not need to wait for the recipient to be ready to receive 
the message. In real life, this corresponds to sending a letter or an e- 
mail. A potential problem is the fact that messages must be stored and 
that message buffers can overflow. There are variations of this scheme, 
including communicating finite state machines (see p. 62) and data-flow 
models (see p. 68). 

Synchronous message passing or blocking communication, rendezvous- 
based communication: In synchronous message passing, available com- 
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ponents communicate in atomic, instantaneous actions called rendezvous. 
The component reaching the point of communication first has to wait 
until the partner has also reached its point of communication. In real life, 
this corresponds to physical meetings or phone calls. There is no risk of 
overflows, but performance may suffer. Examples of languages following 
this model of computation include CSP (see p. 110) and Ada (see p. 111). 
Extended rendezvous, remote invocation: In this case, the sender is 
allowed to continue only after an acknowledgment has been received from 
the recipient. The recipient does not have to send this acknowledgment 
immediately after receiving the message but can do some preliminary 
checking before actually sending the acknowledgment. 


Organization of computations within the components: 


Differential equations: Differential equations are capable of modeling analog 
circuits and physical systems. Hence, they can find applications in cyber- 
physical system modeling. 

Finite state machines (FSMs): This model is based on the notion of a finite 
set of states, inputs, outputs, and transitions between states. Several of these 
machines may need to communicate, forming so-called communicating finite 
state machines (CFSMs). 

Data flow: In the data-flow model, the availability of data triggers the possible 
execution of operations. 

Discrete event model: In this model, there are events carrying a totally 
ordered time stamp, indicating the time at which the event occurs. Discrete 
event simulators typically contain a global event queue sorted by time. Entries 
from this queue are processed according to this order. The disadvantage is that 
this model relies on a global notion of event queues, making it difficult to map 
the semantic model onto parallel implementations. Examples include VHDL 
(see p. 98), SystemC (see p. 97), and Verilog (see p. 109). 

von Neumann model: This model is based on the sequential execution of 
sequences of primitive computations. 


Combined models: Actual languages are typically combining a certain model 
of communication with an organization of computations within components. For 
example, StateCharts (see p. 51) combines finite state machines with shared 
memories. SDL (see p. 62) combines finite state machines with asynchronous 
message passing. Ada (see p. 111) and CSP (see p. 111) combine von Neumann 
execution with synchronous message passing. Table 2.1 gives an overview of 
combined models most of which we will consider in this chapter. This table also 
includes examples of languages for many of the MoCs. 


Let us look at MoCs with a defined model for computations within compo- 
nents. For differential equations, Modelica [399], commercial languages such as 
Simulink® [533], and the extension VHDL-AMS [245] of the hardware description 
language VHDL are examples of languages. 
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Table 2.1 Overview of MoCs and languages considered 


Communication/ Message passing 
organization of components | Shared memory Synchronous Asynchronous 
Undefined components Plain text or graphics, use cases 
(Message) sequence charts 

Differential equations Modelica, Simulink®, VHDL-AMS 
Communicating finite StateCharts SDL 
state machines (CFSMs) 
Data flow Scoreboarding, Kahn networks 

Tomasulo algorithm SDF 
Petri nets C/E nets, P/T nets, ... 
Discrete event (DE) VHDL, Verilog (Only experimental systems) 
model? SystemC Distributed DE in Ptolemy 
von Neumann C, C++, Java C, C++, Java, ... with libraries 
model CSP, Ada 


“The classification of VHDL, Verilog, and SystemC is based on the implementation of these 
languages in simulators. Message passing can be modeled in these languages “on top” of the 
simulation kernel 


Scoreboarding and the Tomasulo algorithm are data flow-driven techniques for 
dynamically scheduling instructions in computer architectures. They are described 
in books in computer architecture (see, e.g., Hennessy and Patterson [211]) and not 
presented in this book. 

Some MoCs have advantages in certain application areas, while others have 
advantages in others. Choosing the “best” MoC for a certain application may be 
difficult. Being able to mix MoCs (such as in the Ptolemy framework [120, 460]) can 
be a way out of this dilemma. Also, models may be translated from one MoC into 
another one. Non-von Neumann models are frequently translated into von Neumann 
models. The distinction between the different models is blurring if the translation 
between them is easy. 

Designs starting from non-von Neumann models are frequently called model- 
based designs [421]. The key idea of model-based design is to have some abstract 
model of the system under design (SUD). Properties of the SUD can then be studied 
at the level of this model, without having to care about software code. Software 
code is generated only after the behavior of the model has been studied in detail, 
and this software is generated automatically [477]. The term “model-based design” 
is usually associated with models of control systems, comprising traditional control 
system elements such as integrators, differentiators, etc. However, this view may be 
too restricted, since we could also start with abstract models of consumer systems. 

In the following, we will present different MoCs, using existing languages as 
examples for demonstrating their features. A related (but shorter) survey is provided 
by Edwards [147]. For a more comprehensive presentation, see [187]. 
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The very first ideas about systems are frequently captured in a very informal way, 
possibly on paper. Frequently, only descriptions of the SUD in a natural language 
such as English or Japanese exist in the early phases of design projects. They 
are typically using a very informal style. These descriptions should be captured 
in some machine-readable document. They should be encoded in the format of 
some word processor and stored by a tool managing design documents. A good 
tool would allow links between the requirements, a dependence analysis as well as 
version management. DOORS® [228] exemplifies such a tool. 


2.3.1 Use Cases 


For many applications, it is beneficial to envision potential usages of the SUD. This 
way, we can make sure that the final system performs as expected in the envisioned 
context. Usages are captured in use cases. Use cases describe possible applications 
of the SUD. Different notations for use cases could be used. 

Support for a systematic approach to early specification phases is the goal of the 
so-called UML™ standardization effort [166, 207, 432]. UML stands for “Unified 
Modeling Language.” UML was designed by leading software technology experts 
and is supported by commercial tools. UML primarily aims at the support of the 
software design process. UML provides a standardized form for use cases. 

For use cases, there is neither a precisely specified model of the computations 
nor a precisely specified model of the communication. It is frequently argued that 
this is done intentionally in order to avoid caring about too many details during the 
early design phases. Nevertheless, attempts have been made to define the semantics 
more formally. 


Example 2.3 Figure 2.6 shows some use cases for an answering machine.’ There 
are five use cases for the owner of the answering machine and one for potential 
callers. We have to make sure that all six use cases can be implemented correctly. V 


Use cases identify different classes of users as well as the applications to be 
supported by the SUD. In this way, it is possible to capture expectations at a very 
high level. 


7We assume that UML is covered in depth in a software engineering course included in the 
curriculum. Therefore, UML is only briefly discussed in this book. 
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Fig. 2.6 Use case example 


2.3.2 (Message) Sequence Charts and Time/Distance Diagrams 


At a more detailed level, we might want to explicitly indicate the sequences of 
messages which must be exchanged between components in order to implement 
some use of the SUD. Sequence charts (SCs)—earlier called message sequence 
charts (MSCs)—provide a mechanism for this. Sequence charts use one dimension 
(usually the vertical dimension) of a two-dimensional chart to denote sequences 
and the second dimension to reflect the different communication components. SCs 
describe partial orders between message transmissions, and they display a possible 
behavior of a SUD. SCs are also standardized in UML. UML 2.0 has extended SCs 
with elements allowing a more detailed description than UML 1.0. 


Example 2.4 Figure 2.7 shows one of the use cases of the answering machine as an 
example. Dashed lines are so-called lifelines. Messages are assumed to be ordered 
according to their sequence along the lifeline. We assume that, in this example, all 
information is sent in the form of messages. Arrows used in this diagram denote 
asynchronous messages. This means several messages can be sent by a sender 
without waiting for the receipt to be confirmed. Boxes on top of lifelines represent 
active control at the corresponding component. In the example, the answering 
machine is waiting for the user to pick up the phone within a certain amount of 
time. If he or she fails to do so, the machine signals a pick-up itself and sends a 
welcome message to the caller. The caller is then supposed to leave a voice-mail 
message. Alternative sequences (e.g., an early termination of the call by the caller 
or the callee picking up the phone) are not shown. V 


Complex control-dependent actions cannot be described by SCs. Other MoCs 
must be used for this. Frequently, certain preconditions must be met for a SC to 
apply. Such preconditions, a distinction between sequences which might happen and 
those which must happen, as well as other extensions are available in the so-called 
Live Sequence Charts [117]. 
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Fig. 2.7 Answering machine in UML™ 
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Time/distance diagrams (TDDs) are a commonly used variant of SCs. In 
time/distance diagrams, the vertical dimension reflects real time, not just sequence. 
In some cases, the horizontal dimension also models the real distance between the 
components. TDDs provide the right means for visualizing schedules of trains or 
buses. 


Example 2.5 Figure 2.8 exemplifies modeling a schedule of trains between Ams- 
terdam, Cologne, Brussels, and Paris using a TDD. Trains can run from either 
Amsterdam or Cologne to Paris via Brussels. Aachen is included as an intermediate 
stop between Cologne and Brussels. Vertical segments correspond to times spent 
at stations. For one of the trains, there is a timing overlap between the trains 
coming from Cologne and Amsterdam at Brussels. There is a second train which 
travels between Paris and Cologne which is not related to an Amsterdam train. 
This example and other examples can be simulated with the levi simulation software 
[498]. 
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Fig. 2.9 Railway traffic displayed by a time/distance diagram (courtesy H. Brändli, IVT, ETH 
Zürich), OETH Zürich 


Example 2.6 A larger, more realistic example is shown in Fig. 2.9. This example 
[224] describes simulated Swiss railway traffic in the Lötschberg area. Different 
station names are shown along the horizontal lines. The vertical dimension reflects 
real time. Slow and fast trains can be distinguished by their slope in the graph. 
Slow trains are characterized by steep slopes, possibly also containing significant 
waiting time at the stations (vertical slopes). For fast trains, slopes are almost flat. 
Trains are stopping only at a subset of the stations. In the presented example, 
it is not known whether the timing overlap at stations happens coincidentally or 
whether some real synchronization for connecting trains is required. Furthermore, 
permissible deviations from the schedule (min/max timing behavior) are not visible. 
V 


SCs and TDDs are very frequently used in practice. For example, they are 
valuable for applications of the IoT. One of the key distinctions between SCs and 
TDDs is that SCs do not include any reference to real time. TDDs are appropriate 
means for representing typical schedules. However, SCs and TDDs both fail to 
provide information about necessary synchronization. 
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UML was initially not designed for real-time applications. UML 2.0 includes 
timing diagrams as a special class of diagrams. Such diagrams enable referring to 
physical time, similar to TDDs. Also, certain UML “profiles” (see p. 121) allow 
additional annotations to refer to time [368]. 


2.3.3 Differential Equations 


Differential equations can be written in the language of mathematics. Inputs for 
design tools typically require certain variants of this language. We exemplify such a 
variant with Modelica [399], a language aiming at modeling cyber-physical systems. 
Modelica has graphical as well as textual forms. Using the graphical form, systems 
can be described as sets of interconnected blocks. Each block can be described by 
equations. Connections between blocks denote common variables in the sense of 
mathematics. The information about each block together with information about 
connections can be transformed into a global set of equations. This process is called 
flattening of the hierarchy. Just like in mathematics, equations (and connections) 
have a bidirectional meaning (in contrast to programming languages). 


Example 2.7 The following model? represents the bouncing ball example of p. 11: 


Model StickyBall 
type Height = Real(unit = "m"); 
type Velocity = Real(unit = "m/s"); 
parameter Real s = 0.8 "Restitution"; 
parameter Height hð = 1.0 "Initial height"; 
constant Velocity eps = 1e-3 "small velocity"; 
Boolean stuck; 
Height h; 
Velocity v; 
initial equation 
v= ð; 
h = hð; 
stuck = false; 
equation 
v = der(h); 
der(v) = if stuck then ð else -9.81; 
when h <= @.@ then 
stuck = abs(v) < eps; 
reinit(v, if stuck then @ else -sxv); 
end when; 
end StickyBall; 


In the equations part, the velocity v is defined as the derivative of the height h. 
The derivative of v (the acceleration) is set to standard gravity (—9.81), unless 


8This model has been derived from a model published by M. Tiller [541]. 
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the ball is already sticking to the surface. Equations have a bidirectional meaning. 
For this set of equations, there are boundary conditions defined in the initial 
equation part. Mathematical equations can be integrated numerically. This pro- 
cedure is exploited in the description of the bouncing: when clauses can be used to 
define events which happen while solving the equations. In the particular example, 
an event is generated when the height becomes less or equal to zero. Whenever this 
event is generated while the velocity is still sufficiently large, the velocity is inverted 
and reduced by a factor of s, called restitution. The reinit clause effectively 
defines another boundary condition. 

However, if the velocity is smaller than eps, the ball is assumed to become sticky, 
and the velocity is set to zero, suppressing all future activities. The resulting model 
can be simulated, for example, with OpenModelica.? 

After being released, the ball travels at a speed and a distance as shown in the 
mathematical background below: 


v= gt (2.1) 
§ 2 

= >t 2.2 

S (2.2) 


This stops when the ball reaches the bottom (x = ho). We call this partially 
elastic collision 0 (or bounce 0), the corresponding time fo, and the corresponding 
velocity vo. From Eqs. (2.1) and (2.2), we compute 


vo = gto (2.3) 
E2 
ho = 7 (2.4) 
and, hence 
paz (2.5) 
8 
2 
to = ,| —ho (2.6) 
g 


vo = y 2gho (2.7) 


After bouncing, the ball travels at speed 
v = —svo + gt (2.8) 


until the velocity becomes 0. Let this happen at time t|. Equation (2.8) leads to 


° See https://openmodelica.org/. 
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0 = —svo + gti 
v 

iS (2.9) 
g 


Compared to Eq. (2.5), the partially elastic collision has reduced the trip time by 
a factor of s. Next, the ball drops again, traveling downward as long as it traveled 
upward. Hence, the next collision (bounce 1) happens 


/ vo 
ti = 2f = 2s— (2.10) 
g 


time units after the initial bounce. In each direction, trip times for bounce 1 are 
shorter by a factor of s compared to the time for bounce 0. The same shortening of 
times will happen for the other bounces. Hence, bounce n happens at time 


vo 2vo k k 2vo : k vo 
n=—>+ s =^ 9s as (2.11) 


As long as s < 1, this (geometric) series converges to 


2vo k w 2% vo 
E 8 sil=«) g 


(2.12) 


This means that there is an upper bound on the time for the bounces, but not on 
the number of bounces. This corresponds to the fact that, mathematically speaking, 
infinite series may be converging to a finite value. !° 

Using sets of equations involving derivatives in Modelica brings us close to 
the language of mathematics and physics. However, events introduce sequential 
behavior. The implicit numerical integration procedure also introduces the hazard 
of numerical precision problems. In fact, already the test h <= Q.2 reflects that 
we might miss the case of h being exactly 0. Another hazard is present in the 
published model for the non-sticky ball [541]: numerical precision problems result 
in an OpenModelica solution for which the ball penetrates the floor for large times 
t. This problem is caused by not generating events if the time distance between 
bounces is too small. 

This example demonstrates very nicely the advantages and limitations of Model- 
ica: on the one hand, it is feasible to describe even the physical part of cyber-physical 
systems. On the other hand, we are not exactly using the language of mathematics, 
and in this way, we are introducing modeling hazards. V 


l0Note the link to the paradox of Achilles and the turtle [585]. 
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2.4 Communicating Finite State Machines (CFSMs) 


In the following sections, we will consider the design of digital systems only. 
Compared to early design phases, we need more precise models of our SUD. We 
mentioned already on p. 17 and on p. 32 that we need to describe state-oriented 
behavior. State diagrams are a classical means of doing this. Figure 2.10 (the same 
as Fig. 2.1) shows an example of a state diagram, representing a finite state machine 
(FSM). 

Circles denote states. We will consider FSMs for which only one of their states 
is active. Such FSMs are called deterministic FSMs. Edges denote state transitions. 
Edge labels represent events. Let us assume that a certain state of the FSM is active 
and that an event happens which corresponds to one of the outgoing edges for the 
active state. Then, the FSM will change its state from the currently active state to 
the one indicated by the edge. FSMs may be implicitly clocked. Such FSMs are 
called synchronous FSMs. For synchronous FSMs, state changes will happen only 
at clock transitions. FSMs may also generate output (not shown in Fig. 2.10). For 
more information about classical FSMs, refer to, for example, Kohavi et al. [301]. 


2.4.1 Timed Automata 


Classical FSMs do not provide information about time. In order to model time, 
classical automata have been extended to also include timing information. Timed 
automata are essentially automata extended with real-valued variables. “The vari- 
ables model the logical clocks in the system, that are initialized with zero when 
the system is started, and then increase synchronously with the same rate. Clock 
constraints, i.e., guards on edges, are used to restrict the behavior of the automaton. 
A transition represented by an edge can be taken when the clocks’ values satisfy 
the guard labeled on the edge. Clocks may be reset to zero when a transition is 
taken” [45]. 


Example 2.8 Figure 2.11 shows the state diagram of an answering machine. The 
machine is usually in the initial state on the left. Whenever a ring signal is received, 
clock x is reset to 0, and a transition into a waiting state is made. If the called person 
lifts off the handset, talking can take place until the handset is returned. 

Otherwise, a transition to state play text can take place if time has reached a value 
of 4. Once the transition took place, a recorded message is played and this phase is 


Fig. 2.10 State diagram 
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Fig. 2.11 Servicing an incoming line in an answering machine 


terminated with a beep. Clock y ensures that this beep lasts at least one time unit. 
After the beep, clock x is reset to 0 again and the answering machine is ready for 
recording. If time has reached a value of 8 or if the caller remains silent, the next 
beep is played. This second beep again lasts at least one time unit. After the second 
beep, a transition is made into the final state. In this example, transitions are either 
caused by inputs (such as lift-off) or by so-called clock constraints. V 


Clock constraints describe transitions which can take place, but they do not have 
to. In order to make sure that transitions actually take place, additional location 
invariants can be defined. Location invariants x <= 5, x <= 9, and y <= 2 are 
used in the example such that transitions will take place no later than one time unit 
after the enabling condition became true. Using two clocks is for demonstration 
purposes only; a single clock would be sufficient. 

Formally speaking, timed automata can be defined as follows [45]: Let C be a set 
of real-valued, non-negative variables representing clocks. Let X be a finite alphabet 
of possible inputs. 


Definition 2.7 A clock constraint is a conjunctive formula of atomic constraints 
of the form x on or (x — y) on for x, y € C, o € {<, <, =, >, >} andn EN. 


Note that constants n used in the constraints must be integers, even though clocks 
are real-valued. An extension to rational constants would be easy, since they could 
be turned into integers with simple multiplications. Let B(C) be the set of clock 
constraints. 


Definition 2.8 (Bengtson [45]) A timed automaton is a tuple (S, sọ, E,/) 
where: 


e Sisa finite set of states. 

e sg is the initial state. 

e ECSx B(C) x = x 2© x S is the set of edges. B(C) models the conjunctive 
condition which must hold and & models the input which is required for a 
transition to be enabled. 2© reflects the set of clock variables which are reset 
whenever the transition takes place. 
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e I: S — B(C) is the set of invariants for each of the states. B(C) represents the 
invariant which must hold for a particular state S. This invariant is described as a 
conjunctive formula. 


This first definition is usually extended to allow parallel compositions of timed 
automata. Timed automata having a large number of clocks tend to be difficult to 
understand. More details about timed automata can be found, for example, in papers 
by Dill et al. [133] and Bengtsson et al. [45]. 

Simulation and verification of timed automata is possible with the popular tool 
UPPAAL.!! UPPAAL supports concurrency and data variables. 

Timed automata extend classical automata with timing information. However, 
many of our requirements for specification techniques are not met by timed 
automata. In particular, in their standard form, they do not provide hierarchy and 
concurrency. 


2.4.2 StateCharts: Implicit Shared Memory Communication 


The StateCharts language is presented here as a very prominent example of 
a language based on automata and supporting hierarchical models as well as 
concurrency. It does include a limited way of specifying timing. 

The StateCharts language was introduced by David Harel [203] in 1987 and later 
described more precisely in [141]. According to Harel, the name was chosen since 
it was “the only unused combination of flow or state with diagram or chart’. 


Modeling of Hierarchy 


The StateCharts language describes extended FSMs. Due to this, they can be used 
for modeling state-oriented behavior. The key extension is hierarchy. Hierarchy is 
introduced by means of superstates. 


Definition 2.9 States comprising other states are called superstates. 


Definition 2.10 States included in superstates are called substates of the super- 
states. 


Example 2.9 The StateCharts diagram in Fig. 2.12 is a hierarchical version of the 
diagram in Fig. 2.10. Superstate S includes states A, B, C, D, and E. 

Suppose the FSM is in state Z (Z will also be called an active state). Now, if 
input m is applied to the FSM, then A and S will be the new active states. If the 
FSM is in S and input k is applied, then Z will be the new active state, regardless 


'lSee http://www.uppaal.org for the academic and http://www.uppaal.com for the commercial 
version. 
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Fig. 2.12 Hierarchical state S 
diagram 


of whether the FSM is in substates A, B, C, D, or E of S. In this example, all states 
contained in S are non-hierarchical states. V 


In general, substates of S could again be superstates consisting of substates them- 
selves. Also, whenever a substate of some superstate is active, the superstate is 
active as well. 


Definition 2.11 States which are not composed of other states are called basic 
states. 


The FSM of Fig. 2.12 can only be in one of the substates of substate S at any 
time. Superstates of this type are called OR-superstates. !? 


Definition 2.12 Superstate S is called an OR-superstate if the system comprising 
S is in exactly one substate of S whenever it is in S. 


In Fig. 2.12, k might correspond to an exception for which state S has to be left. 
The example already shows that the hierarchy introduced in StateCharts enables a 
compact representation of exceptions. 

StateCharts allows hierarchical descriptions of systems in which a system 
description comprises descriptions of subsystems which, in turn, may contain 
descriptions of subsystems. The hierarchy of the entire system can be represented 
by a tree. The root of the tree corresponds to the system as a whole, and all inner 
nodes correspond to hierarchical descriptions (called super-nodes in StateCharts). 
The leaves of the hierarchy are non-hierarchical descriptions (called basic states in 
StateCharts). 

So far, we have used explicit, direct edges to basic states to indicate the next 
state. With this approach, the internal structure of superstates cannot be hidden from 
the environment. In a true hierarchical environment, we should be able to hide the 
internal structure so that it can be described later or changed later without affecting 
the environment. This is possible with other mechanisms for describing the next 
state. 

The first additional mechanism is the default state mechanism. It can be used 
in superstates to indicate the particular substates that will become active if the 


More precisely, they should be called XOR-superstates, since the FSM is in either A, B, C, D, 
or E. However, this name is not commonly used in the literature. 
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Fig. 2.13 State diagram S 
using the default state 
mechanism 


Fig. 2.14 State diagram S 


using the history and the 
default state mechanism \ EOL) | 
o 
m 
Fig. 2.15 Combining the S 


symbols for the history and ; 
the default state mechanism tA) SEM LD) | 
E 


superstates become active. In diagrams, default states are identified by edges starting 
at small filled circles. 


Example 2.10 Figure 2.13 shows a state diagram using the default state mechanism. 
The diagram is equivalent to Fig. 2.12. The filled circle itself is not a state. V 


Another mechanism for specifying next states is the history mechanism. With 
this mechanism, it is possible to return to the last substate that was active before a 
superstate was left. The history mechanism is symbolized by a circle containing the 
letter H. Do not confuse circles comprising this letter with states! We will be using 
a different font for states and the history mechanism in order to reduce the risk of 
confusion. In order to define the next state for the initial transition into a superstate, 
the history mechanism is frequently combined with the default mechanism. 


Example 2.11 Consider the state diagram in Fig. 2.14. The behavior of the FSM 
is now somewhat different. If we input m while the system is in Z, then the 
FSM will enter A if this is the very first time we enter S, and otherwise it will 
enter the last state that we were in before leaving S. This mechanism has many 
applications. For example, if k denotes an exception, we could use input m to 
return to the state we were in before the exception. States A, B,C, D, and E 
could also call Z like a procedure. After completing “procedure” Z, we would 
return to the calling state. In this way, we are adding elements of programming 
languages to StateCharts. Figure 2.14 can also be redrawn as shown in Fig. 2.15. 
In this case, the symbols for the default and the history mechanism are combined. 

V 
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Fig. 2.16 Answering machine 


Specification techniques must be able to describe concurrency conveniently. For 
this, StateCharts provides a second class of superstates, so-called AND-superstates. 


Definition 2.13 Superstates S are called AND-superstates if the system containing 
S will be in all of the substates of $ whenever it is in S. 


Example 2.12 An AND-superstate is included in the answering machine example 
shown in Fig. 2.16. An answering machine normally performs two tasks concur- 
rently: it is monitoring the line for incoming calls and the keys for user input. In 
Fig. 2.16, the corresponding states are called Lwait and Kwait. Incoming calls are 
processed in state Lproc, while the response to pressed keys is generated in state 
Kproc. State Lproc is left whenever the caller hangs up the phone. Returning to 
state Lwait due to call termination by the owner is not modeled. Hence, this model 
provides no protection against stalking. 

For the time being, we assume that the on/off switch (generating events key-off 
and key-on) is decoded separately and pushing it does not result in entering Kproc. 
If the machine is switched off, the line monitoring state and the key monitoring state 
are left and reentered only if the machine is switched on. At that time, default states 
Lwait and Kwait are entered. While switched on, the machine will always be in the 
line monitoring state as well as in the key monitoring state. V 


For AND-superstates, the substates entered as a result of entering the superstate 
can be defined independently. There can be any combination of history, default 
and explicit transitions. It is crucial to understand that all substates will always 
be entered, even if there is just one explicit transition to one of the substates. 
Accordingly, transitions out of an AND-superstate will always result in leaving all 
the substates. 
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Fig. 2.17 Answering machine with modified on/off switch processing 


Fig. 2.18 Timer in 
StateCharts 


Example 2.13 For example, let us modify our answering machine such that the 
on/off switch, like all other switches, is decoded in state Kproc (see Fig. 2.17). 

If pushing that key is detected in Kwait, transitions are assumed first into state 
Kproc and then into the off state. The second transition results in leaving the line- 
monitoring state as well. Switching the machine on again results in also entering the 
line-monitoring state. V 


AND-superstates provide the key mechanism for describing concurrency in 
StateCharts. Each substate can be considered a state machine by itself. These 
machines are communicating with each other, forming communicating finite state 
machines (CFSMs). This term has been used as the title of this section. 

Summarizing, we can state the following: states in StateCharts diagrams are 
either AND-superstates, OR-superstates, or basic states. 


Timers 


Due to the requirement to model time in embedded systems, StateCharts also 
provides timers. Timers are denoted by the symbol shown in Fig. 2.18 on the left. 
After the system has been in the state containing the timer for the specified time, 
a timeout will occur, and the system will leave the specified state. Timers can also 
be used hierarchically. 
Timers can be employed, for example, at the next lower level of the hierarchy of 
the answering machine in order to describe the behavior of state Lproc. Figure 2.19 
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shows a possible behavior for that state. The timing specification is slightly different 
from the one in Fig. 2.11. 

Due to the exception-like transition for hangups by the caller in Fig. 2.16, state 
Lproc is terminated whenever the caller hangs up. For hangups (returns) by the 
callee, the design of state Lproc results in an inconvenience: if the callee hangs up 
the phone first, the telephone will be dead (and quiet) until the caller has also hung 
up the phone. 

The StateCharts language includes a number of other language elements. For a 
full description, refer to Harel [203]. A more detailed description of the semantics 
of StateCharts is described by Drusinsky and Harel [141]. 


Fig. 2.19 Servicing the 
incoming line in Lproc 


Edge Labels and StateMate Semantics 


Until now, we have not considered outputs generated by our extended FSMs. 
Generated outputs can be specified using edge labels. The general form of an edge 
label is “event [condition]/reaction.” All three label parts are optional. The reaction 
part describes the reaction of the FSM to a state transition. Possible reactions include 
the generation of events and assignments to variables. The condition part implies 
a test of the values of variables or a test of the current state of the system. The 
event part refers to a test of current events. Events can be generated either internally 
or externally. Internal events are generated as a result of some transition and are 
described in reaction parts. External events are usually described in the model 
environment. 

Examples: 


e on-key/on:=1 (Event test and variable assignment), 

e [on=1] (Condition test for a variable value), 

e off-key [not in Lproc]/on:=0 (Event test, condition test for a state, variable 
assignment. The assignment is performed if the event has occurred and the 
condition is true). 


The semantics of edge labels can only be explained in the context of the seman- 
tics of StateMate [141], a commercial implementation of StateCharts. StateMate 
assumes a step-based execution of StateMate descriptions, as shown in Fig. 2.20. 
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Fig. 2.20 Steps during the Status Step Status Step Status Step Status 
execution of a StateMate oO >O- >O- 


model < cS << = 
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Steps are assumed to be executed each time events or variables have changed. 
The set of all values of variables, together with the set of events generated (and 
the current time), is defined as the status!* of a StateMate model. After executing 
the third phase, a new status is obtained. 

The notion of steps allows us to define the semantics of events more precisely. 
The visibility of events is limited to the step following the one in which they 
are generated. Thus, events behave like single bit values which are stored in 
permanently enabled registers at one clock transition and have an effect on the 
values stored at the next clock transition. They do not live forever. 

Variables, in contrast, retain their values until they are reassigned. According to 
StateMate semantics, new values of variables are visible to all parts of the model 
from the step following the step in which the assignment was made onward. That 
means that StateMate semantics implies that new values of variables are propagated 
to all parts of a model between two steps. 

Each step consists of three phases: 


1. In the first phase, the impact of external changes on conditions and events is 
evaluated. This includes the evaluation of functions which depend on external 
events. This phase does not include any state changes. In our simple examples, 
this phase is not actually needed. 

2. The next phase is to calculate the set of transitions that should be made in the 
current step. Variable assignments are evaluated, but the new values are only 
assigned to temporary variables. 

3. In the third phase, state transitions become effective and variables obtain their 
new values. 


The separation into phases 2 and 3 is important in order to guarantee a 
reproducible behavior of StateMate models. 


Example 2.14 Consider the StateMate model of Fig. 2.21. 

In the second phase, new values for a and b are stored in temporary variables, 
say a’ and b’. In the final phase, these variables are copied into the user-defined 
variables: 


phase 2: a’:=b; b’:=a; 
phase 3: a:=a’; b:=b’; 


'3We would normally use the term “state” instead of “status”. However, the term “state” has a 
different meaning in StateMate. 
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Fig. 2.21 Mutually swap 
dependent assignments 
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Fig. 2.22 Cross-coupled clock ~~) 
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As a result, the values of the two variables will be swapped each time an event e 
happens. This behavior corresponds to that of two cross-coupled registers (one for 
each variable) connected to the same clock (see Fig. 2.22) and reflects the operation 
of a synchronous (clocked) finite state machine including those two registers. !4 

Without the separation into phases, the same value would be assigned to both 
variables. The result would depend on the sequence in which the assignments were 
performed. V 


The separation into (at least) two phases is quite typical for languages that try to 
reflect the operation of synchronous hardware. We will find the same separation in 
VHDL (see p. 107). Due to the separation, the results do not depend on the order in 
which parts of the model are executed by the simulation. This property is extremely 
important. Otherwise, there could be simulation runs generating different results, 
all of which would be considered correct. This is not what we expect from the 
simulation of a real circuit with a fixed behavior, and it could be very confusing 
in design procedures. There are different names for this property: 


e Kahn [278] calls this property determinate. 
e In other papers, this property is called deterministic. However, the term “deter- 
ministic” is employed with different meanings: 


— It is used in the context of deterministic finite state machines, FSMs which 
can only be in one state at a time. In contrast, non-deterministic finite state 
machines can be in several states at the same time [221]. 

— Languages may have non-deterministic operators. For these operators, dif- 
ferent behaviors are legal implementations. Approximate, non-deterministic 
computations would be a relevant special case of non-deterministic operators. 

— Many authors consider systems to be non-deterministic if their behavior 
depends on some input not known before run-time. 


'4We adopt IEEE standard schematic symbols [238] for gates and registers for all the schematics 
in this book. The symbols in Fig. 2.22 denote clocked D-type registers. 
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Fig. 2.23 Left, conflict 
between different nesting A 
levels; right, conflict at the OO 
same nesting level 
A x<20 x>10 


— The term “deterministic” has also been used in the sense of “determinate,” as 
introduced by Kahn. 


In this book, we prefer to reduce possible confusion by following Kahn.'° Note 
that StateMate models can be determinate only if there are no other reasons for an 
undefined behavior. For example, conflicts between transitions may be allowed (see 
Fig. 2.23). 

Consider Fig. 2.23 (left). If event A takes place while the system is in the left 
state, we must figure out which transition will take place. If these conflicts would 
be resolved arbitrarily, then we would have a non-determinate behavior. Typically, 
priorities are defined such that this type of a conflict is eliminated. Now, consider 
Fig. 2.23 (right). There will be a conflict for, e.g., x = 15. Such conflicts are difficult 
to detect. Achieving a determinate behavior requires the absence of conflicts that are 
resolved in an arbitrary manner. 

Note that there may be cases in which we would like to describe non-determinate 
behavior (e.g., if we have a choice to read from two inputs). In such a case, we would 
typically like to explicitly indicate that this choice can be taken at run-time (see the 
select statement of Ada on p. 112). 

Implementations of hierarchical state charts other than StateMate typically do 
not exhibit determinate behavior. These implementations correspond to a software- 
oriented view onto hierarchical state charts. In such implementations, choices are 
usually not explicitly described. 


Evaluation and Extensions 


StateMate implicitly assumes a broadcast mechanism for updates on variables. 
Hence, StateCharts or StateMate can be implemented easily for shared memory- 
based platforms but are less appropriate for message passing and distributed 
systems. These languages essentially assume shared memory-based communica- 
tion, even though this is not explicitly stated. For distributed systems, it will be very 
difficult to update all variables between two steps. Due to this broadcast mechanism, 
StateMate is not an appropriate language for modeling distributed systems. 

Hence, StateCharts’ main application domain is that of local, control-dominated 
systems. The capability of nesting hierarchies at arbitrary levels, with a free choice 


'5In the first edition of the book, we used the term “deterministic” together with an additional 
explanation. 
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of AND-and OR-superstates, is a key advantage of StateCharts. Another advantage 
is that the semantics of StateMate is defined at a sufficient level of detail [141]. 
Furthermore, there are quite a number of commercial tools based on StateCharts. 
StateMate [229] and StateFlow [382] are examples of commercial tools based on 
StateCharts. Many of them are capable of translating StateCharts into equivalent 
descriptions in C or VHDL (see p. 98). From VHDL, hardware can be generated 
using synthesis tools. Therefore, StateCharts-based tools provide a complete path 
from StateCharts-based specifications down to hardware. Generated C programs 
can be compiled and executed. Hence, a path to software-based realizations exists 
as well. 

Unfortunately, the efficiency of the automatic translation is sometimes a concern. 
For example, we could map substates of AND-superstates to processes at the 
operating system level. This would hardly lead to efficient implementations on small 
processors. The productivity gain from object-oriented programming is not available 
in StateCharts, since it is not object-oriented. StateCharts do not comprise program 
constructs for describing complex computation and cannot describe hardware 
structures or non-functional behavior. StateCharts allows timeouts. There is no 
straightforward way of specifying other timing requirements. 

Commercial implementations of StateCharts typically provide some mechanisms 
for removing the limitations of the model. For example, C code can be used 
to represent program constructs, and module charts of StateMate can represent 
hardware structures. 

UML includes a variation of StateCharts and hence allows modeling state 
machines. In UML, these diagrams are called state diagrams in version | of UML 
and state machine diagrams from version 2.0 onward. Unfortunately, the semantics 
of state machine diagrams in UML is different from StateMate: the three simulation 
phases are not included. 


2.4.3 Synchronous Languages 
Motivation 


Describing complex SUDs in terms of state machine diagrams is difficult. Such dia- 
grams cannot express complex computations. Standard programming languages can 
express complex computations, but the sequence of executing several threads may 
be unpredictable. In a multithreaded environment with preemptive scheduling, there 
can be many different interleavings of the different computations. Understanding 
all possible behaviors of such concurrent systems is difficult. A key reason for this 
is that, in general, many different execution orders are feasible, i.e., the execution 
order is not specified. The order of execution may well affect the result. The 
resulting non-determinate behavior can have a number of negative consequences, 
such as problems with verifying a certain design. For distributed systems with 
independent clocks, determinate behavior is difficult to achieve. However, for 
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non-distributed systems, we can try to avoid the problems of unnecessary non- 
determinate semantics. 

For synchronous languages, finite state machines and programming languages 
are merged into one model. Synchronous languages can express complex computa- 
tions, but the underlying execution model is that of finite automata. They describe 
concurrently operating automata. Determinate behavior is achieved by the following 
key feature: “. .. when automata are composed in parallel, a transition of the product 
is made of the “simultaneous” transitions of all of them” [197]. This means we do 
not have to consider all the different sequences of state changes of the automata that 
would be possible if each of them had its own clock. Instead, we can assume the 
presence of a single global clock. In each clock tick, all inputs are considered, new 
outputs and states are calculated, and then the transitions are made. This requires 
a fast broadcast mechanism for all parts of the model. This idealistic view of 
concurrency has the advantage of guaranteeing determinate behavior. This is a 
restriction if compared to the general communicating finite state machines (CFSM) 
model, in which each FSM can have its own clock. Synchronous languages reflect 
the principles of operation in synchronous hardware and also the semantics found in 
control languages such as IEC 60848 [231] and STEP 7 [488]. See Potop-Butucaru 
et al. [458] for a survey on synchronous languages. 


Examples of Synchronous Languages: Esterel, Lustre, and SCADE 


Guaranteeing a determinate behavior for all language features has been a design goal 
for the synchronous languages Esterel [61, 154], Lustre [199], and Quartz [480]. 

Esterel is a reactive language: when activated with an input event, Esterel models 
react by producing an output event. Esterel is a synchronous language: all reactions 
are assumed to be completed in zero time, and it is sufficient to analyze the behavior 
at discrete moments in time. This idealized model avoids all discussions about 
overlapping time ranges and about events that arrive while the previous reaction 
has not been completed. Like other concurrent languages, Esterel has a parallelism 
operator, written ||. Similar to StateCharts, communication is based on a broadcast 
mechanism. In contrast to StateCharts, however, communication is instantaneous. 
Instantaneous in this context means “within the same clock cycle.” This means that 
all signals generated in a particular clock cycle are also seen by the other parts of 
the model in the same clock cycle, and these other parts, if sensitive to the generated 
signals, react in the same clock cycle. Several rounds of evaluations may be required 
until a stable state is reached. The computation of resulting worst case reaction 
times is performed, for example, by Boldt et al. [56]. The propagation of values 
during the same macroscopic instant of time corresponds to the generation of a 
next status for the same moment in time in StateMate, except that the broadcast 
is now instantaneous and not delayed until the next round of evaluations like in 
StateMate. For more and updated information about Esterel, refer to the Esterel 
home page [154]. 
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Esterel and Lustre use different syntactic techniques to denote CFSMs. Esterel 
appears as a kind of imperative language, whereas Lustre looks more like a data- 
flow language (see p. 68 for a description of data flow). SyncCharts is a graphical 
version of Esterel. In all three cases, semantics are explained by the closely related 
underlying CFSMs. The commercial graphical language SCADE [19] combines 
elements of all three languages. The so-called SCADE suite® is used for a number 
of safety-critical software components, for example, by Airbus. 

Due to the three simulation phases in StateMate, this tool has the key attributes 
of synchronous languages, and it is determinate if conflicts are resolved. According 
to Halbwachs, “StateMate is almost a synchronous language and the only feature 
missing in StateMate is the instantaneous broadcast” [198]. 


2.4.4 Message Passing: SDL as an Example 
Features of the Language 


StateCharts is not appropriate for modeling distributed communicating finite state 
machines. For distributed systems, message passing is the better communication 
paradigm. Therefore, we present a case of communicating finite state machines with 
asynchronous message passing. 

We use SDL (specification and description language) as an example. SDL was 
designed for distributed applications. It dates back to the 1970s. Formal semantics 
have been available since the 1980s. The language was standardized by the ITU 
(International Telecommunication Union). The first standards document is the Z. 700 
Recommendation published in 1980 with updates, for example, in 1992, 1999, 2011, 
and 2016 [482]. The update of 1999 is known as SDL-2000. 

Many users prefer graphical specification languages, while others prefer textual 
ones. SDL pleases both types of users since it provides textual as well as graphical 
formats. Processes are the basic elements of SDL. Processes represent components 
modeled as extended finite state machines. Figure 2.24 shows the symbols used in 
the graphical representation of SDL. 


Example 2.15 Let us consider an FSM similar to that of Fig. 2.13. The FSM of 
Fig. 2.25 is similar to that of Fig. 2.13, except that output has been added, state Z 
has been deleted, and the effect of signal k has been changed. 

Figure 2.26 contains the corresponding graphical SDL representation. 

The representation in Fig. 2.26 is equivalent to the state diagram of Fig. 2.25. V 


Fig. 2.24 Symbols used in identifies initial state state 
the graphical form of SDL r = 
< input > output 
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Fig. 2.25 FSM to be 
described in SDL 
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As an extension to FSMs, SDL processes can perform operations on data. 
Variables can be declared locally for processes. Their type can either be predefined 
or defined in the SDL description itself. SDL supports abstract data types (ADTs). 
The syntax for declarations and operations is similar to that in other languages. 
Figure 2.27 shows how declarations, assignments, and decisions can be represented 
in SDL. 

SDL also contains programming language elements such as procedures. Pro- 
cedure calls can also be represented graphically. Object-oriented features became 
available with version SDL-1992 of the language and were extended with SDL- 
2000. 

Extended FSMs are just the basic elements of SDL descriptions. In general, SDL 
descriptions will consist of a set of interacting processes, or FSMs. Processes can 
send signals to other processes. Semantics of interprocess communication in SDL 
is based on asynchronous message passing and conceptually implemented through 
first-in first-out (FIFO) queues associated with processes. There is exactly one 
input queue per process. Signals sent to a particular process will be placed into 
the corresponding FIFO queue (see Fig. 2.28). 
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Fig. 2.28 SDL interprocess 
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Fig. 2.29 Process interaction diagram 
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Fig. 2.30 Left, process name identifies recipient; right, channel identifies recipient 


Each process is assumed to fetch the next available entry from the FIFO queue 
and check whether it matches one of the inputs described for the current state. If 
it does, the corresponding state transition takes place and output is generated. The 
entry from the FIFO queue is ignored if it does not match any of the listed inputs 
(unless the so-called SAVE mechanism is used). FIFO queues are conceptually 
thought of as being of infinite length. This means in the description of the semantics 
of SDL models, FIFO overflow is never taken into account. In actual systems, 
however, infinite FIFO queues cannot be implemented. They must be of finite 
length. This is one of the problems of SDL: in order to derive realizations from 
specifications, safe upper bounds on the length of the FIFO queues must be proven. 

Process interaction diagrams can be used for visualizing which of the processes 
are communicating with each other. Process interaction diagrams include channels 
used for sending and receiving signals. In the case of SDL, the term “signal” denotes 
inputs and outputs of modeled automata. 


Example 2.16 Figure 2.29 shows a process interaction diagram B1 with channels 
Sw1 and Sw2. Brackets include the names of signals propagated along a certain 
channel. V 


There are three ways of indicating the recipient of signals: 


1. Through process identifiers: By using identifiers of recipient processes in the 
graphical output symbol (see Fig. 2.30 (left)). 
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The number of processes does not need to be fixed at compile time, since 
processes can be generated dynamically at run-time. OFFSPRING represents 
identifiers of child processes generated dynamically by a process. 

2. Explicitly: By indicating the channel name (see Fig. 2.30 (right)). Sw1 is the 
name of a channel. 

3. Implicitly: If signal names imply the channel names, those channels are used. 
Example: For Fig. 2.29, signal B will implicitly always be communicated via 
channel Sw1. 


No process can be defined within any other (processes cannot be nested). 
However, they can be grouped hierarchically into so-called blocks. Blocks at the 
highest hierarchy level are called systems. A system will not have any channels 
at its boundary if the environment is also modeled as a block. Process interaction 
diagrams are special cases of block diagrams. Process interaction diagrams are one 
level above the leaves of the hierarchical description. 


Example 2.17 Block B1 of Example 2.16 can be used within intermediate level 
blocks (such as within B in Fig. 2.31). 

At the highest level in the hierarchy, we have the system (see Fig. 2.32). 
Figure 2.33 shows the hierarchy modeled by the block diagrams in Figs. 2.29, 2.30, 
2.31, and 2.32. 

This example demonstrates that process interaction diagrams are next to the 
leaves of the hierarchical description, while system descriptions represent their root. 


V 
Fig. 2.31 SDL block Block B 
C2 C4 
Bt i B2 m 
| ca 
Fig. 2.32 SDL system System S 
B <<. A 
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Fig. 2.33 SDL hierarchy 
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Fig. 2.34 Using timer Z 


Some of the restrictions of modeling hierarchy are removed in version SDL-2000 
of the language. With SDL-2000, the descriptive power of blocks and processes is 
harmonized and replaced by a general agent concept. 

In order to support the modeling of time, SDL includes timers. Timers can 
be declared locally for processes. They can be set using the SET primitive. This 
primitive has two parameters: an absolute time and a timer name. The absolute time 
defines a time at which the timer elapses. The built-in function now can be used to 
refer to the time at which the SET primitive is executed. Once a timer is elapsed, 
a signal is stored in the input queue. The name of this signal is obtained from the 
second parameter of the SET call. The signal will then typically cause a certain 
transition to take place in the FSM. However, this transition may be delayed by other 
entries in the input queue which have to be processed first. Hence, this timer concept 
is designed for soft timing constraints typically found in telecommunications and 
inappropriate for hard timing constraints. A second built-in function expirytime can 
be used to avoid some of the limitations of the now function. 

Timers can be reset using the RESET primitive. This primitive will stop the 
counting process and—in case the signal has already been stored in the input 
queue—removes the signal from it. An implicit RESET is executed at the very 
beginning of executing a SET. 


Example 2.18 Figure 2.34 shows the use of a timer Z. The diagram corresponds to 
that of Fig. 2.26, with the exception that timer Z is set to the current time now plus 
T during the transition from state D to E. For the transition from E to A, we now 
have a timeout of T time units. If these time units have elapsed before signal f has 
arrived, a transition to state A is taken without generating output signal v. Strictly 
periodic processing with a period of T is difficult to achieve this way, due to the 
possible delays by other entries in the input queue. V 
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Fig. 2.36 Protocol stacks represented in SDL 


Example 2.19 SDL can be used to describe protocol stacks found in computer 
networks, and SDL is very appropriate for this. Figure 2.35 shows three processors 
connected through a router. Communication between processors and the router is 
based on FIFOs. The processors as well as the router implement layered protocols 
(see Fig. 2.36). Each layer describes communication at a more abstract level. 
The behavior of each layer is typically modeled as a finite state machine. The 
detailed description of these FSMs depends on the network protocol and can be 
quite complex. Typically, this behavior includes checking and handling of error 
conditions, as well as sorting and forwarding of information packets. V 


Available tools for SDL include interfaces to UML (see p. 120) and SCs (see 
p. 43). A comprehensive list of tools is available from the SDL forum [483]. 

Estelle [74] is another language which was designed to describe communication 
protocols. Similar to SDL, Estelle assumes communication via channels and FIFO 
buffers. Attempts to unify Estelle and SDL failed. 


Evaluation of SDL 


SDL is excellent for distributed applications, and it is very useful as a reference 
model for asynchronous message passing. SDL is not necessarily determinate (the 
order, in which signals arriving at some FIFO at the same time are processed, is not 
specified). Reliable implementations require the knowledge of an upper bound on 
the length of the FIFOs. This upper bound may be difficult to compute. The timer 
concept is sufficient for soft deadlines, but not for hard ones. Hierarchies are not 
supported in the same way as in StateCharts. There is no full programming support 
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(but revisions of the standard changed this) and no description of non-functional 
properties. SDL has been used, for example, for specifying ISDN. Currently, there 
seems to be a trend for a generalization of discussions, from the specific case of 
SDL toward general system description techniques [482]. 


2.5 Data Flow 


2.5.1 Scope 


Data flow is a very “natural” way of describing real-life applications. Data-flow 
models reflect the way in which data flows from component to component [146]. 
Each component transforms the data in one way or the other. The following is a 
possible definition of data flow: 


Definition 2.14 ([582]) Data-flow modeling “is the process of identifying, model- 
ing, and documenting how data moves around an information system. Data flow 
modeling examines processes (activities that transform data from one form to 
another), data stores (the holding areas for data), external entities (what sends data 
into a system or receives data from a system), and data flows (routes by which data 
can flow).” 


A data flow program is specified by a directed graph where the nodes (vertices), 
called actors, represent computations and the arcs represent communication chan- 
nels. The computation performed by each actor is assumed to be functional, that is, 
based on the input values only. Each process in a data flow graph is decomposed into 
a sequence of firings, which are atomic actions. Each firing produces and consumes 
tokens. 


Example 2.20 Figure 2.37 describes the flow of data in a video-on-demand (VOD) 
system [298]. Viewers are entering the system via the network interface. Their 
admission request is added to the customer queue. Once they are admitted, their 
requests are scheduled for the file system. The file system, in cooperation with 
storage control, makes videos available to the customer. V 


For unrestricted data flow, it is difficult to prove requested system properties. 
Therefore, restricted models are commonly used. 

A special type of data flow is used for implementing out-of-order execution of 
instructions in computer architectures. This type of execution is also known as 
dynamic scheduling of instructions. Two algorithms for dynamic scheduling are 
well-known: scoreboarding and the Tomasulo algorithm [544]. Both algorithms are 
covered in detail in books on computer architecture (see, e.g., Hennessy et al. [211]). 
Therefore, they are not included in this book. There are variants of these algorithms 
which are applied at task level (e.g., see Wang et al. [564]). 
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Fig. 2.37 Video-on-demand system (blue, storage; yellow, processing; green, I/O) 


2.5.2 Kahn Process Networks 


Kahn process networks (KPN) [278] are a special case of data-flow models. Like 
other data-flow models, KPNs consist of nodes and edges. Nodes correspond to 
computations performed by some program or task. KPN graphs, like all data-flow 
graphs, show computations to be performed and their dependencies, but not the 
order in which the computations must be performed (in contrast to specifications 
in von Neumann languages such as C). Edges imply communication via channels 
containing potentially infinite FIFOs. Computation times and communication times 
may vary, but communication is guaranteed to happen within a finite amount of 
time. Writes are non-blocking, since the FIFOs are assumed to be as large as needed. 
Reads must specify a single channel to be read from. A node cannot check whether 
data is available before attempting a read. A process cannot wait for data on more 
than one port at a time. Read operations block whenever an attempt is made to read 
from an empty FIFO queue. Only a single process is allowed to read from a certain 
queue, and only a single process is allowed to write into a queue. So, if output 
data has to be sent to more than a single process, data must be duplicated inside 
processes. There is no other way for communication between processes except 
through FIFO queues. In the following example, t1 and t2 are incrementing and 
decrementing the value received from the partner: 
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Fig. 2.38 Graphical v 
representation of KPN FIFO 
u 
FIFO 


process t1(in int u, out int v){ 
int i; 
i = ð; 
for (;;) { 
send(i,v); /* send i via channel v */ 
i = wait(u); /* read i from channel u */ 
i = i-1; 
} 
} 
process t2(in int v, out int u){ 
int i; 
for (;;) { 
i = wait(v); 
i = itl; 
send(i,u); 
} 
J 


Figure 2.38 shows a graphical representation of this KPN. 

Obviously, we do not really need the FIFOs in this example, since messages 
cannot accumulate in the channels. This example and other examples can be 
simulated with the levi simulation software [496]. 

The mentioned restrictions for read and write operations are resulting in the key 
beauty of KPNs: the order in which a node is reading data from its channels is 
fixed by the sequence of read operations and does not depend on the order in which 
producers are transmitting data over the channels. This means that the sequence of 
operations is independent of the speed of the nodes producing data. For a given set 
of input data, KPNs will always generate the same results, independently of the 
speed of the nodes. This property is important, for example, for simulations: it does 
not matter how fast we are simulating the KPN; the result will always be the same. 
In particular, the result does not depend on using hardware accelerators for some of 
the nodes, and a distributed execution will give the same result as a centralized one. 
This property has been called “determinate” and we are following this use. SDL-like 
conflicts at FIFOs do not exist. Due to this nice property, KPNs are frequently used 
as an internal representation within a design flow. 

Sometimes, KPNs are extended with a “merge” operator (corresponding to Ada’s 
select statement, see p. 112). This operator allows for queuing read commands 
containing a list of channels. The operator completes execution after the first of 
these channels has generated data. Such an operator introduces a non-determinate 
behavior: the order of processing inputs is not specified if both inputs arrive at 
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the same time. This extension is useful in practice, but it destroys the key beauty 
of KPNs. 

In general, Kahn processes require scheduling at run-time, since it is difficult to 
predict their precise behavior over time. These problems result from the fact that we 
do not make any assumptions regarding the speed of the channels and the nodes. 
Nevertheless, execution times are actually unknown during early design phases, and 
therefore this model is very adequate. 

KPNs are Turing-complete, which means whatever can be computed by a Turing 
machine (the standard model for computability) can also be computed by a KPN. 
The proof is based on the fact that KPNs are a superset of so-called Boolean dataflow 
(BDF) and according to Buck [73] BDF can simulate Turing machines. However, 
the number of processes has to be fixed at design time, which is an important 
limitation for many applications. 

Whether or not finite-length FIFOs are sufficient for an actual KPN model is 
undecidable in the general case. However, useful scheduling algorithms [293] or 
proofs of the boundedness of the FIFOs [99] exist for some special cases. For 
example, bounds can be derived for polyhedral process networks (PPNs). For PPNs, 
the code for each of the nodes includes loops with bounds known at compile time. 
Derin [125] exploits knowledge about the code of the nodes for dynamic task 
migration. 


2.5.3 SDF 


Scheduling becomes significantly easier, and questions regarding buffer sizes can 
decidably be answered if we impose sufficient restrictions on the timing of nodes 
and channels. For SDF [328], this is the case. Initially, the acronym SDF was a 
shorthand for synchronous data flow. Today, it is increasingly interpreted to denote 
static data flow. 

SDF can be introduced by referring to its graphical notation. SDF models include 
a directed graph, i.e., SDF models contain nodes and directed edges. Nodes are also 
called actors. Edges can store tokens, by default an unlimited number of them. 
Some of the edges will initially contain some tokens. Each edge has an incoming 
and an outgoing weight. The execution of an SDF model assumes a clock. For an 
actor to be enabled, it is necessary that for each of the edges leading to that actor, 
the number of tokens on that edge is at least equal to the outgoing weight for the 
edge. 


Example 2.21 Figure 2.39 (left) shows an SDF graph. Actors A and B denote 
computations. Input edges like the one shown at the top for actor A are assumed 
to supply an infinite stream of tokens. Actor B is enabled since there is a sufficient 
number of tokens on the edges leading to B. Actor A is not enabled. At each clock 
tick, enabled actors can fire, but they do not have to. If they fire, the number of 
tokens on the incoming edges gets decreased by the incoming weight, and the 
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Fig. 2.39 Graphical representation of SDF: left, initial situation; right, after firing B 
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Fig. 2.40 Replacing explicit FIFO buffers by backward edges 


Fig. 2.41 SDF loop 3 2 ep 4 6 


number of tokens on the outgoing edges is increased by the outgoing weight. In 
our example, the resulting number of tokens is shown in Fig. 2.39 on the right. 
Obviously, the number of tokens produced or consumed in a particular firing is 
static (does not vary during the execution of the model). V 


In practice, tokens will represent data, actors will represent computations, and 
edges should correspond to FIFO buffers. Buffers on the edges imply that SDF 
uses asynchronous message passing. Instead of using the default unlimited buffer 
capacities, we can express limited buffer capacities with backward edges. The initial 
number of tokens on these backward edges corresponds to the capacity of the FIFO 
buffer. This is shown in Fig. 2.40. The two models shown in Fig. 2.40 are equivalent. 
For example, the first firing of A will consume three tokens from the backward edge, 
leaving only one token on the backward edge, corresponding to the one empty FIFO 
slot after the first firing of A on the left. 

The property of producing and consuming a static number of tokens makes it 
possible to determine execution order and memory requirements at compile time. 
Hence, complex run-time scheduling of executions is avoided. SDF graphs can be 
translated into periodic schedules. 


Example 2.22 Let us have a closer look at schedules of SDF models. Consider the 
example shown in Fig. 2.41. Suppose that initially there are six tokens for edge e1. 
Then, Table 2.2 (left) shows the resulting schedule for firings. Due to the limited 
number of initial tokens, only sequential firings are feasible. Now, let us assume 
that there are nine initial tokens for edge e;. Assuming that all actors fire as soon 


2.5 Data Flow 73 


Table 2.2 Schedules for loop in SDF: left, six initial tokens on e4; right, nine initial tokens on e1 


Tokens on edges | Next actor action Tokens on edges | Next actor action 

Clock | e; | e2 AorB Clock | e; | e2 A or B or (A and B) 
0 6 |0 A 0 9 |0 A 

1 3 |2 A 1 6 |2 A 

2 0 4 B 2 3 j4 A and B 

3 6 0 A 3 6 |2 A 

4 3 |2 A 4 3 j4 A and B 

Fig. 2.42 SDF delay ooo 
1 


as possible, the schedule of Table 2.2 (right) is produced. Under this assumption, A 
and B fire synchronously. V 


During the generation of schedules, we could also consider constraints and 
objectives such as a limited number of available processors [57]. 

In this example, using edge labels 2, 3, 4, and 6 resulted in different execution 
rates of actors A and B. In general, edge labels facilitate the modeling of multi-rate 
signal processing applications, applications for which certain signals are generated 
at frequencies that are multiples of other frequencies. For example, in a TV set, some 
computations might be performed at a rate of 100 Hz, while others are performed 
at a rate of 50 Hz. Ignoring some initial transient phase and considering longer 
periods, the number of tokens sent to an edge must be equal to the number of 
tokens consumed. Otherwise, tokens would accumulate in the FIFO buffers, and no 
finite FIFO capacity would be sufficient. Let ns be the number of tokens produced 
by some sender per firing, and let fs be the corresponding rate. Let n, be the 
corresponding number of tokens consumed per firing at the receiver, and let f, be 
the corresponding rate. Then, we must have 


ns * fs = Ny * fr (2.13) 


This condition is met in the steady state for the example shown in Table 2.2. 

SDF graphs may include delays, denoted by the symbol D on an edge (see 
Fig. 2.42). 

The observer pattern, mentioned as a problem for modeling with von Neumann 
languages on p. 35, can be easily implemented correctly in SDF (see Fig. 2.43). 
There is no risk of deadlocks. However, SDF does not allow adding new observers 
at run-time. 

The letter S in SDF initially was meant to stand for the term synchronous, 
since enabled nodes fire synchronously. However, the two schedules in Table 2.2 
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Fig. 2.43 Observer pattern 
in SDF 


demonstrate that cases of firing all actors synchronously may indeed be very rare. 
Therefore, the “S” in SDF has also been reinterpreted to denote the term “static” 
instead of “synchronous.” 

SDF models are determinate [206], but they are not appropriate for modeling 
control flow, such as branches, etc. Several extensions and variations of SDF models 
have been proposed (see, e.g., Stuijk [515]): 


e For example, we can have modes corresponding to states of an associated finite 
state machine. For each of the modes, a different SDF graph could be relevant. 
Certain events could then cause transitions between these modes. We could have 
modes for different video resolutions and could have a transition whenever we 
change the resolution. 

e Homogeneous synchronous data-flow (HSDF) graphs are a special case of SDF 
graphs. For HSDF graphs, the number of tokens consumed and produced per 
firing is always 1. 

e For cyclo-static data flow (CSDF), the number of tokens produced and con- 
sumed per firing can vary over time but has to be periodic. 


Complex SUDs including control flow must be modeled using more general 
computational graph structures. 


2.5.4 Simulink 


Computational graph structures are also frequently used in control engineering. For 
this domain, the Simulink® toolbox of MATLAB® [529, 533] is very popular. 
MATLAB is a modeling and simulation tool based on mathematical models 
including, for example, partial differential equations. Figure 2.44 shows an example 
of a Simulink model [365]. 

The amplifier Gain6 and the saturation component Saturation on the right demon- 
strate the inclusion of analog modeling. In the general case, the “schematic” could 
contain symbols denoting analog components such as integrators and differentiators. 
The switch in the center indicates that Simulink also allows some control flow 
modeling. 

The graphical representation is intuitive and allows control engineers to focus 
on the control function, without caring about the code necessary to implement the 
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Fig. 2.44 Simulink model 


function. The graphical symbols suggest that analog circuits are used as traditional 
components in control designs. A key goal is to synthesize software from such 
models. This approach is typically associated with the term model-based design. 

Semantics of Simulink models reflect the simulation on a digital computer, and 
the behavior may be similar to that of analog circuits, but possibly not quite the 
same. What is actually the semantics of a Simulink model? Marian and Ma [365] 
describe the semantics as follows: “Simulink uses an idealized timing model for 
block (node) execution and communication. Both happen infinitely fast at exact 
points in simulated time. Thereafter, simulated time is advanced by exact time steps. 
All values on edges are constant in between time steps.” This means that we execute 
the model time step after time step. For each step, we compute the function of 
the nodes (in zero time) and propagate the new values to connected inputs. This 
explanation does not specify the distance between time steps. Also, it does not 
immediately tell us how to implement the system in software, since even slowly 
varying outputs may be recomputed frequently. 

This approach is appropriate for modeling physical systems such as cars or trains 
at a high level and then simulating the behavior of these systems. Also, digital signal 
processing systems can be conveniently modeled with MATLAB® and Simulink®. 
In order to generate implementations, MATLAB/Simulink models first must be 
translated into a language supported by software or hardware design systems, such 
as C or VHDL. This way of generating software can be considered a case of model- 
based design. Model-based design could be a way of avoiding time-consuming 
manual code generation, but this requires that the issues mentioned above do not 
block the applicability of this approach. 

Components in Simulink models provide a special case of actors. We can assume 
that actors are waiting for input and perform their operation once all required inputs 
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have arrived. SDF and KPNs are other cases of actor-based languages. In actor- 
based languages, there is no need to pass control to these actors, like in von 
Neumann languages. This has the advantage of providing freedom for scheduling 
of computations in software. 


2.6 Petri Nets 


2.6.1 Introduction 


Very comprehensive descriptions of control flow are feasible with computational 
graphs known as Petri nets. Actually, Petri nets model only control and control 
dependencies. Modeling data as well requires extensions of Petri nets. Petri nets 
focus on the modeling of causal dependencies. 

In 1962, Carl Adam Petri published his method for modeling causal depen- 
dencies, which became known as Petri nets [450]. Petri nets do not assume any 
global synchronization and are therefore especially suited for modeling distributed 
systems. 

Conditions, events, and a flow relation are the key elements of Petri nets. 
Conditions are either satisfied or not satisfied. Events can happen. The flow relation 
describes the conditions that must be met before events can happen, and it also 
describes the conditions that become true if events happen. Graphical notations for 
Petri nets typically use circles to denote conditions and boxes to denote events. 
Arrows represent flow relations. 


Example 2.23 Our first example, shown in Fig. 2.45, describes mutual exclusion 
for trains on a railroad track that must be used in both directions. A token is used 
to prevent collisions of trains going into opposite directions. In the Petri net, that 
token is symbolized by a condition in the center of the model. A partially filled 
circle (a circle containing a second, filled circle) denotes that a condition is met (this 
means the track is available). When a train wants to travel to the right (also denoted 


train entering track from the left train leaving track to the right 


; : : ‘ train going : 
train wanting to go right Y totheright Y 


track available 


O50 eO 


train going 
to the left 
<— single-laned <— 


Fig. 2.45 Single-track railroad segment 
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Fig. 2.46 Using resource “track” 
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Fig. 2.47 Freeing resource “track” 


by a partially filled circle in Fig. 2.45), the two conditions that are necessary for 
the event “train entering track from the left” are met. We call these two conditions 
preconditions. If the preconditions of an event are met, it can happen. 

As a result of that event happening, the token is no longer available, and there is 
no train waiting to enter the track. Hence, the preconditions are no longer met, and 
the partially filled circles disappear (see Fig. 2.46). 

However, there is now a train going on that track from the left to the right, and 
thus the corresponding condition is met (see Fig. 2.46). A condition which is met 
after an event happened is called a postcondition. In general, an event can happen 
only if all its preconditions are true (or met). If it happens, the preconditions are no 
longer met, and the postconditions become valid. Arrows identify those conditions 
which are preconditions of an event and those that are postconditions of an event. 
Continuing with our example, we see that a train leaving the track will return the 
token to the condition at the center of the model (see Fig. 2.47). 

Now, consider two trains competing for the single-track segment (see Fig. 2.48). 

Only one train can enter. In such situations, the next transition to be fired is 
non-deterministically chosen. Analyses of the net must consider all possible firing 
sequences. For Petri nets, we are intentionally modeling non-determinism. V 


A key advantage of Petri nets is that they can be the basis for formal proofs about 
system properties and that there are standardized ways of generating such proofs. In 
order to enable such proofs, we need a more formal definition of Petri nets. We will 
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Fig. 2.48 Conflict for resource “track” 


consider three classes of Petri nets: condition/event nets, place/transitions nets, and 
predicate transition nets. 


2.6.2 Condition/Event Nets 


Condition/event nets are the first class of Petri nets that we will define more 
formally. 


Definition 2.15 N = (C, E, F) is called a net iff the following holds: 


1. C and E are disjoint sets. 
2. F C (E x C)U (C x E) isa binary relation, called flow relation. 


The set C is called conditions and the set EF is called events. 


Definition 2.16 Let N be a net and let x € (CUE). °x := {ylyFx,y € (CU 
E)} is called the pre-set of x. If x denotes an event, °x is also called the set of 
preconditions of x. 


Definition 2.17 Let N be a net and let x € (C U E). x° := {y|xFy,y € (CU 
E)} is called the post-set of x. If x denotes an event, x° is also called the set of 
postconditions of x. 


The terms preconditions and postconditions are preferred if these sets actually 
denote conditions € C, that is, if x € E. 


Definition 2.18 Let (c, e) € C x E. (c, e) is called a loop if cFe A eF c. 
Definition 2.19 Let (c,e) € C x E. N is called pure if F does not contain any 
loops (see Fig. 2.49 (left)). 


Definition 2.20 A net is called simple if no two transitions ¢, and t have the same 
set of pre- and postconditions (see Fig. 2.49 (center) and (right)). 
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Fig. 2.49 Nets which are not pure (left) and not simple (center and right) 


Simple nets with no isolated elements meeting some additional restrictions are 
called condition/event nets. Condition/event nets are a special case of bipartite 
graphs (graphs with two disjoint sets of nodes). We will not discuss those additional 
restrictions in detail since we will consider more general classes of nets in the 
following. 


2.6.3 Place/Transition Nets 


For condition/event nets, there is at most one token per condition. For many applica- 
tions, it is useful to remove this restriction and to allow more tokens per condition. 
Nets allowing more than one token per condition are called place/transition nets. 
Places correspond to what we so far called conditions, and transitions correspond to 
what we so far called events. The number of tokens per place is called a marking. 
Mathematically, a marking is a mapping from the set of places to the set of natural 
numbers extended by a special symbol w denoting infinity. 

Let No denote the natural numbers including 0. Then, formally speaking, 
place/transition nets can be defined as follows: 


Definition 2.21 (P, T, F, K, W, Mo) is called a place/transition net <> 


1. N = (P,T, F) is a net with places p € P, transitions t € T, and flow relation F. 

2. Mapping K : P —> (No U {a}) \ {0} denotes the capacity of places (w symbolizes 
infinite capacity). 

. Mapping W : F —> (No \ {0} denotes the weight of graph edges. 

4. Mapping Mọ : P —> No U {a} represents the initial marking of places. 


W 


Edge weights affect the number of tokens that are required before transitions 
can happen and also identify the number of tokens that are generated if a certain 
transition takes place. Let M (p) denote a current marking of place p € P, and 
let M'(p) denote a marking after some transition t € T took place. The weight of 
edges belonging to preconditions represents the number of tokens that are removed 
from places in the pre-set. Accordingly, the weight of edges belonging to the 
postconditions represents the number of tokens that are added to the places in the 
post-set. Formally, marking M’ is computed as follows: 
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Fig. 2.50 Generation of a es) Í 
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Figure 2.50 demonstrates how transition t; affects the current marking. By 
default, unlabeled edges are considered to have a weight of 1, and unlabeled places 
are considered to have unlimited capacity œw. 

We now need to explain the two conditions that must be met before a transition 
t € T can take place: 


e for all places p in the pre-set, the number of tokens must at least be equal to the 
weight of the edge from p to t and 

e forall places p in the post-set, the capacity must be large enough to accommodate 
the new tokens which ¢ will generate. 


Transitions meeting these two conditions are called M-activated. Formally, this 
can be defined as follows: 


Definition 2.22 Transition t € T is said to be M-activated <=> 
(Vp € *t: M(p) = W(p,t)) A (Vp Et: M(p') + W(t, p) < K(p’)) 


Activated transitions can happen, but they do not need to. If several transitions are 
activated, the sequence in which they happen is not deterministically defined. 

The impact of a firing transition ¢ on the number of tokens can be represented 
conveniently by a vector t associated with ż. t is defined as follows: 


—W(p,t), ifpe *t\ t 
‘n= +W(t, p), ifpe t®\ °t 
~ —W(p,t)+ W(t, p) ifpe€ °tn tr 
0 otherwise 


The new number M’ of tokens, resulting from the firing of transition rt, can be 
computed for all places p as follows: 


M'(p) = M(p) + t(p) 


Using “+” to denote vector addition, we can rewrite this equation as follows: 
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The set of all vectors t forms an incidence matrix N. N contains vectors £ as 
columns. 


N:PxT >Z, WeT:N(p,t)=t(p) 


It is possible to formally prove system properties by using matrix N. For 
example, we are able to compute sets of places, for which firing transitions will not 
change the overall number of tokens [468]. Such sets are called place invariants. 
Let us initially consider a single transition ¢; in order to find such invariants. Let 
us search for sets R C P of places such that the total number of tokens does not 
change if t; fires. The following must hold for such sets: 


Vrp (2.14) 


pER 


Figure 2.51 shows a transition for which the total number of tokens does not 
change if it fires. 
We are now introducing the characteristic vector cp of some set R of places: 


_fliffpeR 
c)=| a 


With this definition, we can rewrite Eq. (2.14) as 


bs tjp) = » t)(P)-Cr(p) =; Cr =0 (2.15) 


pER peP 


- denotes the scalar product. Now, we search for sets of places such that firings of 
any transition will not change the total number of tokens. This means that Eq. (2.15) 
must hold for all transitions f;: 


t)-Cr=0 


ty Cp =0 (2.16) 


t, Crp =O 
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Equations (2.16) can be combined into the following equation by using the 
transposed incidence matrix NT: 


NT .cp=0 (2.17) 

Equation (2.17) represents a system of linear, homogeneous equations. Matrix N 
represents edge weights of our Petri nets. We are looking for solution vectors Cp for 
this system of equations. Solutions must be characteristic vectors. Therefore, their 
components must be | or 0 (integer weights can be accepted if we use weighted 
sums of tokens). This is more complex than solving systems of linear equations 
with real-valued solution vectors. Nevertheless, it is possible to obtain information 
by solving Eq. (2.17). Using this proof technique, we can, for example, show that 
we are correctly implementing mutually exclusive access to shared resources. 


Example 2.24 Let us now consider a larger example: We are again considering the 
synchronization of trains. In particular, we are trying to model high-speed Thalys 
trains traveling between Amsterdam, Cologne, Brussels, and Paris. Segments of 
the train run independently from Amsterdam and Cologne to Brussels. There, the 
segments get connected and then they run to Paris. On the way back from Paris, they 
get disconnected at Brussels again. We assume that Thalys trains must synchronize 
with some other train at Paris. The corresponding Petri net is shown in Fig. 2.52. 
Places 3 and 10 model trains waiting at Cologne and Amsterdam, respectively. 
Transitions 2 and 9 model trains driving from these cities to Brussels. After their 
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Table 2.3 N7 for the Thalys example 


pı p2 P3 P4 PS P6 P7 P8 P9 Plo | Pil P12 P13 
ty 1 —1 —1 1 
to 1 —1 


arrival at Brussels, places 2 and 9 contain tokens. Transition 1 denotes connecting 
the two trains. Place 13 models the driver of one of the trains, who will have a break 
at Brussels while the other driver is continuing on to Paris. Transition 5 models 
synchronization with other trains at the Gare du Nord station of Paris. These other 
trains connect Gare du Nord with some other station (we have used Gare de Lyon 
as an example, even though the situation at Paris is somewhat more complex). Of 
course, Thalys trains do not use steam engines; they are just easier to visualize than 
modern high-speed trains. Table 2.3 shows matrix NT for this example. 

For example, row 2 indicates that firing t) will increase the number of tokens 
on p2 by 1 and decrease the number of tokens on p3 by 1. Using techniques from 
linear algebra, we are able to show that the following four vectors are solutions for 
this system of linear equations: 


cr = (1,1,1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0) 
cr2 = (1,0,0,0, 1, 1,0, 0, 1, 1, 1, 0, 0) 
cr3 = (0,0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1) 
cr4 = (0,0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0) 


These vectors correspond to the places along the track for trains from Cologne, 
to the places along the track for trains from Amsterdam, to the places along the 
path for drivers of trains from Amsterdam, and to the places along the track within 
Paris, respectively. Therefore, we are able to show that the number of trains and 
drivers along these tracks is constant (something which we actually expect). This 
example demonstrates that place invariants provide us with a standardized technique 
for proving properties about systems. V 
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2.6.4 Predicate/Transition Nets 


Condition/event nets as well as place/transition nets can quickly become very large 
for large examples. A reduction of the size of the nets is frequently possible with 
predicate/transition nets. 


Example 2.25 We will demonstrate this, using the so-called dining philosophers’ 
problem as an example. The problem is based on the assumption that a set of 
philosophers is dining at a round table. In front of each philosopher, there is a plate 
containing spaghetti (see Fig. 2.53). Between each of the plates, there is just one 
fork. Each philosopher is either eating or thinking. Eating philosophers need their 
two adjacent forks for that, so they can only eat if their neighbors are not eating. 
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This situation can be modeled as a condition/event net, as shown in Fig. 2.54. 
Conditions t; correspond to the thinking states, conditions ej correspond to the 
eating states, and conditions fj represent available forks. Considering the small size 
of the problem, this net is already very large. The size of this net can be reduced 
by using predicate/transition nets. Figure 2.55 is a model of the same problem as 
a predicate/transition net. With predicate/transition nets, tokens have an identity 
and can be distinguished from each other. Predicate/transition nets have also been 
called colored Petri nets (CPN). See Jensen [272] for a survey of applications of 
CPNs for modeling of ICT systems, including communication protocols. We use 
this in Fig. 2.55 in order to distinguish between the three different philosophers pı 
to p3 and to identify fork f3. Furthermore, edges can be labeled with variables and 
functions. In the example, we use variables to represent the identity of philosophers 
and functions /(x) and r(x) to denote the left and right forks of philosopher x, 
respectively. These two forks are required as a precondition for transition u and 
returned as a postcondition by transition v. This model can be easily extended to 
n > 3 philosophers. We just need to add more tokens. In contrast to the net in 
Fig. 2.54, the structure of the net does not have to be changed. V 


2.6.5 Evaluation 


The key advantage of Petri nets is their power for modeling causal dependencies. 
Standard Petri nets have no notion of time, and all decisions can be taken locally 
by just analyzing transitions and their pre- and postconditions. Therefore, they can 
be used for modeling geographically distributed systems. Furthermore, there is a 
strong theoretical foundation for Petri nets, simplifying formal proofs of system 
properties. Petri nets are not necessarily determinate: different firing sequences can 
lead to different results. The descriptive power of Petri nets encompasses that of 
other MoCs, including finite state machines. 

In certain contexts, their strength is also their weakness. If time is to be modeled, 
standard Petri nets cannot be used. Furthermore, standard Petri nets have no notion 
of hierarchy and no programming language elements, let alone object-oriented 
features. In general, it is difficult to represent data. 

There are extended versions of Petri nets avoiding the mentioned weaknesses. 
However, there is no universal extended version of Petri nets meeting all require- 
ments mentioned at the beginning of this chapter. Nevertheless, due to the increasing 
amount of distributed computing, Petri nets became more popular. 

UML includes extended Petri nets called activity diagrams. Extensions include 
symbols denoting decisions (like in ordinary flow charts). The placement of symbols 
is similar to SDL. 


Example 2.26 Figure 2.56 shows an activity chart of the procedure to be followed 
during a standardization process. Forks and joins of control correspond to transitions 
in Petri nets, and they use the symbols (horizontal bars) that were initially used 
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Fig. 2.56 Activity diagram [299] 


for Petri nets as well. The diamond at the bottom shows the symbol used for 
decisions. Activities can be organized into “swim lanes” (areas between vertical 
dotted lines) such that the different responsibilities and the documents exchanged 
can be visualized. V 


Interestingly, Petri nets were initially not a mainstream technique. Decades after 
their invention, they have become a popular technique due to their inclusion 
in UML. 
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Fig. 2.57 Two cross-connected NOR-gates forming an RS latch 


Table 2.4 Sequence of 


. t<0 |t=0 |t>0 
values at inputs and outputs 


of RS latch pe 1 a a a I 
= eral le lee ee 
ea ae er e Te 
ee Mer er ae ee 


2.7 Discrete Event-Based Languages 


2.7.1 Basic Discrete Event Simulation Cycle 


The discrete event-based model of computation is based on simulating the genera- 
tion of events and processing them over time. We use a queue of future events, and 
these are sorted by the time at which they should be processed. Semantics is defined 
by fetching the event at the head of the queue, performing the corresponding actions, 
and possibly entering new events into the queue. Time is advanced whenever no 
action exists which should be performed at the current time. This is the basic 
algorithm: 


loop 
fetch next entry from queue; 
perform function (e.g., assignment of variables as listed in the entry) 
(this may include the generation of new events); 

until termination criterion is met; 


Hardware description languages (HDLs) are typically based on the discrete event 
model. We will use HDLs as a prominent example of discrete event modeling. 


Example 2.27 We demonstrate the application of this general scheme to simulate 
an RS-latch (see Fig. 2.57). The latch consists of two cross-coupled NOR-gates. 
The corresponding code in a hardware description language, in this case VHDL, is 
included in Fig. 2.57 as well. A representative sequence of values at the inputs and 
outputs is shown in Table 2.4. 

Let us assume that initially, the latch is set, and this state is maintained, i.e., 
output Qis '1' andR = S = 'Q'. The operation of both NOR-gates is described by 
processes gate1 and gate2. These processes are initially inactive, waiting for some 
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event on their inputs a or b. This waiting is expressed by the lists (a,b). gate1 and 
gate2 are said to be sensitive to the entries in that list. 

Now, suppose that at time 0, input R, the reset input, is changed to '1'. We expect 
the latch to be reset. In terms of events, this works as follows: The change at input 
R is an event, which is stored in the queue of future events. 

This event is immediately processed, since it is the only event in the queue. This 
event will wake up gate2, since this gate is sensitive to changes on its input b. 
gate2 will compute the nor function, with a result of '@', and will then perform 
the assignment c <= 'Q'. This notation indicates a signal assignment. This means 
that the new values will initially be stored only in the entries of future events. The 
actual assignment to the variable on the left becomes effective only when the time 
for processing this entry in the list of future events has been reached. In our example, 
an event requesting output c of gate2 to be set to 'Q' will be created and stored in 
the event queue. 

This event will be immediately fetched, since it is the only event. The event will 
set output c to 'Q'. This wakes up gate’, due to its sensitivity. gate1 will compute 
the nor function as well. This computation results in an event, requesting output c 
of gate1 to be set to '1'. This event will also be stored in the queue 

This event will also be immediately processed, setting the output as requested. 
This change will wake up gate2 again. gate2 will again compute an output of '@'. 
Further details will depend somewhat on the mechanism which is used to detect 
stable situations not requiring further events to be generated. 

We could have added delays in terms of real physical units to each of the signal 
assignments, which would have allowed us to keep track of elapsed time. Overall, 
this event-based simulation approximates the behavior of a real latch. V 


2.7.2 Multi- Valued Logic 


Which values could we use for the signals in the above example? In this book, 
we are restricting ourselves to embedded systems implemented with binary logic. 
Nevertheless, it may be advisable or necessary to use more than two values for 
modeling such systems. For example, our systems might contain electrical signals 
of different strengths. It may be necessary to compute the strength and the logic 
level resulting from a connection of two or more sources of electrical signals. In 
the following, we will therefore distinguish between the level and the strength of 
a signal. While the former is an abstraction of the signal voltage, the latter is an 
abstraction of the impedance (resistance) of the voltage source. We will be using 
discrete sets of signal values representing the signal level and the strength. Using 
discrete sets of strengths avoids the problems of having to solve Kirchhoff’s 
network equations and enables us to avoid analog models used in electrical 
engineering. We will also model unknown electrical signals by special signal 
values. 
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In practice, electronic design systems use a variety of value sets. Some systems 
allow only 2, while others allow 9 or 46. The overall goal of developing discrete 
value sets is to avoid the problems of solving network equations and still model 
existing systems with sufficient precision. 

In the following, we will present a systematic technique for building up value 
sets and relating these to each other. We will use the strength of electrical signals as 
the key parameter for distinguishing between various value sets. A systematic way 
of building up value sets, called CSA theory, was presented by Hayes [208]. CSA 
stands for “connector, switch, attenuator.” These three elements are key elements of 
this theory. We will later show how the standard value set used for most cases of 
VHDL-based modeling can be derived as a special case. 


One Signal Strength (Two Logic Values) 


In the simplest case, we will start with just two logic values, called '@' and '1'. 
These two values are considered to be of the same strength. This means if two wires 
connect values '@' and '1', we will not know anything about the resulting signal 
level. 

A single signal strength may be sufficient if no two wires carrying values 'Q' 
and '1' are connected and no signals of different strength meet at a particular node 
of electronic circuits. 


Two Signal Strengths (Three and Four Logic Values) 
In many circuits, there may be instances in which a certain electrical signal is not 


actively driven by any output. This may be the case, when a certain wire is not 
connected to ground, the supply voltage, or any circuit node. For example, systems 


may contain open-collector outputs (see Fig. 2.58 (left)).!° 
VDD ———— VDD 
f—ae 
lieu A enable — A 
— PD F — 
Ground Ground — 


Input='9' — A disconnected enable='9' — A disconnected 


Fig. 2.58 Effectively disconnectable outputs: left, open collector output; right, tristate output 


'©Schematics should help students to understand signal values, not make it more difficult. Students 
unfamiliar with schematics could just study logic values. 
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If the “pull-down” transistor PD is non-conducting, the output is effectively 
disconnected. For the tristate outputs (see Fig. 2.58 (right)), an enable signal of '@' 
will generate a '@' at the outputs of the AND-gates (denoted by &) and will make 
both transistors non-conducting. As a result, output A will be disconnected.!’ Hence, 
using appropriate input signals, such outputs can be effectively disconnected from a 
wire. 

The signal strength of disconnected outputs is the smallest strength that we 
can think of. We will denote the value at disconnected outputs as 'Z'. The signal 
strength of 'Z' is smaller than that of '@' and '1'. Furthermore, the signal level of 
'Z' is unknown. If a signal of value 'Z' is connected to another signal, that other 
signal will always dominate. For example, if two tristate outputs are connected to 
the same bus and if one output contributes a value of 'Z', the resulting value on the 
bus will always be the value contributed by the second output (see Fig. 2.59). 

In most cases, three-valued logic sets {'@','1','Z'} are extended by a fourth 
value called 'X'. 'X' represents an unknown signal level of the same strength as 'Q' 
or '1'. More precisely, we are using 'X' to represent unknown values of signals that 
can be either 'Q' or '1' or some voltage representing neither 'Q' nor '1'.!® 

If multiple signals get connected, we have to compute the resulting value. This 
can be done easily if we make use of a partial order among the four signal values 
'Q', '1', 'Z', and 'X'. The partial order is depicted in the Hasse diagram in Fig. 
2.60. 

Edges in this figure reflect the domination of signal values. Edges define a 
relation >. If a > b, then a dominates b. '@' and '1' dominate 'Z'. 'X' dominates 
all other signal values. Based on the relation >, we define a relation >. a > b holds 
iffa > bora =b. 


17Pull-up transistors may be depletion transistors, and the tristate outputs may be inverting. 
18 There are other interpretations of 'X' [65], but ours is the most useful in our context. 
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We define an operation sup on two signals, which returns the supremum of the 
two signal values. 


Definition 2.23 Let a and b be two signal values from a partially ordered set (S, >). 
The supremum c € S of the two values a and b is the smallest value for which c > a 
and c > b holds. 


For example, sup ('Z', '@')='@', sup('Z','1')='1', sup ('0', '1')='X', ete. 


Lemma 2.1 Let a and b be two signals having values from a partially ordered set, 
where the partial order has been selected as shown above. Then, the sup function 
computes resulting signal values if the two signals get connected. 


The supremum corresponds to the connect element of the CSA theory. 


Three Signal Strengths (Seven Signal Values) 


In many circuits, two signal strengths are not sufficient. A common case that 
requires more values is the use of depletion transistors (see Fig. 2.61). 

The effect of the depletion transistor is similar to that of a resistor providing a 
low conductance path to the supply voltage VDD. The depletion transistor and the 
“pull-down transistor” PD act as drivers for node A of the circuit, and the signal value 
at node A can be computed using the supremum function. The pull-down transistor 
provides a driver value of '@' or 'Z', depending upon the input to PD. The depletion 
transistor provides a signal value, which is weaker than 'Q' and '1'. Its signal 
level corresponds to the signal level of '1'. We represent the value contributed by 
the depletion transistor by 'H', and we call it a “weak logic one.” Similarly, there 
can be weak logic zeros, represented by 'L'. The value resulting from the possible 
connection between 'H' and 'L' is called a “weak logic undefined,’ denoted as 'W'. 
As a result, we have three signal strengths and seven logic values {'@', '1', 'L', 
'H', 'W', 'X', 'Z'}. Computing the resulting signal value can again be based on a 
partial order among these seven values. The corresponding partial order is shown in 
Fig. 2.62. 

sup is also defined for this partially ordered set. For example, sup('H','@') = 
'Q', supC'H','Z')='H'; sup('H'; 'L')='W'. 
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'Q' and 'L' represent the same signal levels but a different strength. The same 
holds for the pairs '1' and 'H'. Devices increasing signal strengths are called 
amplifiers; devices reducing signal strengths are called attenuators. 


Four Signal Strengths (Ten Signal Values) 


In some cases, three signal strengths are not sufficient. For example, there are cir- 
cuits using charges stored on wires. Such wires are charged to levels corresponding 
to '@' or '1' during some phases of the operation of the electronic circuit. This 
stored charge can control the (high impedance) inputs of some transistors. However, 
if these wires get connected to even the weakest signal source (except 'Z'), they 
lose their charge, and the signal value from that source dominates. 


Example 2.28 In Fig.2.63, we are driving a bus from a specialized output. The 
bus has a high capacitive load C. While function f is still '0', we set pre to 
'1', charging capacitor C. Then we set pre to 'Q'. If the real value of function 
f becomes known and it turns out to be '1', we discharge the bus. V 


The key reason for using precharging is that charging a bus using an output such 
as the one shown in Fig. 2.61 is a slow process, since the resistance of depletion 
transistors is large. Discharging through regular pull-down transistors PD is a much 
faster process. 

In order to model such cases, we need signal values which are weaker than 
'H' and 'L' but stronger than 'Z'. We call such values “very weak signal values” 
and denote them by 'h' and 'l'. The corresponding very weak unknown value is 
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denoted by 'w'. As a result, we obtain ten signal values {'@', '1', 'L', 'H', 'U', 
"h', 'X', 'W', 'w', 'Z'}. Using signal strengths, we can again define a partial order 
among these values (see Fig. 2.64). 

Note that precharging is not without risks. Once a precharged wire is discharged 
due to a transient signal, it cannot be recharged during the same clock period. 


Five Signal Strengths 


So far, we have ignored power supply signals. These are stronger than the strongest 
signals we have considered so far. Signal value sets taking power supply signals 
into account have resulted in the definition of initially popular 46-valued value sets 
[106]. However, such models are hardly used anymore. 


2.7.3 Transaction-Level Modeling (TLM) 


Discrete event simulation allows us to keep track of simulated time. However, it is 
not obvious how precisely we will be modeling time. A very precise model reflecting 
detailed timing of hardware signals will require long simulation times. In particular, 
very long simulation times are needed when we model electrical circuits. Faster 
simulation is feasible with cycle-accurate models reflecting the number of clock 
cycles in a clocked (synchronous) system implementation. More simulation speed 
can be gained from more coarse-grained timing models. In particular, transaction- 
level modeling (TLM) has received much attention. TLM has been defined as 
follows: 


Definition 2.24 ((191]) “Transaction-level modeling (TLM) is a high-level 
approach to modeling digital systems where details of communication among 
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Fig. 2.65 Distinction between different timing models 


modules are separated from the details of the implementation of functional units 
or of the communication architecture. Communication mechanisms such as buses 
or FIFOs are modeled as channels, and are presented to modules using SystemC 
interface classes. Transaction requests take place by calling interface functions 
of these channel models, which encapsulate low-level details of the information 
exchange. At the transaction level, the emphasis is more on the functionality of 
the data transfers — what data are transferred to and from what locations — and 
less on their actual implementation, that is, on the actual protocol used for data 
transfer. This approach makes it easier for the system-level designer to experiment, 
for example, with different bus architectures (all supporting a common abstract 
interface) without having to recode models that interact with any of the buses, 
provided these models interact with the bus through the common interface.” 


A more detailed distinction between different timing models was described by 
Cai and Gajski [83]. They distinguish between timing models for communication 
and for computation!’, and they consider different cases of timing models, depend- 
ing upon how precisely communication and computation are modeled. Six cases are 
shown in Fig. 2.65. 

For communication as well as for computations, we distinguish between 
untimed, approximately timed, and cycle-timed models. In the diagram in Fig. 2.65, 
crosses mark three unbalanced combinations of timing models, which have not been 
considered by Cai and Gajski. As a result, we consider six remaining cases [83]: 


A  Untimed models: In this case, we model only the functionality and do not 
consider timing at all. Such models are appropriate for early design phases. 
They can be called specification model. 

B In the specification model, we can replace pure functionality descriptions 
by descriptions of components using rough timing models. For example, we 
might know the WCET of some code running on a processor. We would still 


19This is very much in line with the same distinction which we have made in Table 2.1 on p. 41. 
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model communication by abstract communication primitives. As a result, we 
obtain node B in Fig. 2.65. Such a model can be called component assembly 
model. 

C Ina model of type B, we could replace abstract communication primitives by 
communication models which are approximately timed. This means that we 
try to model access conflicts and their impact on the timing, but we do not 
model the impact of each and every signal, nor do we model any links to clock 
cycles. Such a model can be called bus arbitration model. 

D Ina model of type C, we could replace rough communication timing models 
with cycle-timed models. This implies that we keep track of elapsed clock 
cycles in our simulation. We might even consider real, physical time. The 
resulting model, denoted as node D in Fig. 2.65, can be called a bus functional 
model [83]. 

E Ina model of type C, we could also replace rough computation timing models 
by cycle-accurate timing models of the computation. This allows us, for 
example, to capture memory references in detail. The resulting model can be 
called a cycle-accurate computation model. 

F The node labeled F is obtained when communication and computation are 
modeled in a cycle-accurate way. Such a model can be called an implementa- 
tion model. 


Design procedures need to traverse the diagram in Fig. 2.65 from A to F, from the 
bottom left to the top right. 


2.7.4 SpecC 


The SpecC language [173] provides us with a nice example for demonstrating TLMs 
and a clear separation between communication and computation. SpecC models 
systems as hierarchical networks of behaviors communicating through channels. 
SpecC descriptions consist of behaviors, channels, and interfaces. Behaviors include 
ports, locally instantiated components, private variables and functions, and a public 
main function. Channels encapsulate communication. They include variables and 
functions, which are used for the definition of a communication protocol. Interfaces 
are linking behaviors and channels together. They declare the communication 
protocols which are defined in a channel. SpecC can model hierarchies with nested 
behaviors. 


Example 2.29 Figure 2.66 [173] shows a component G including sub-components 
g1 and g2 as leaves in the hierarchy. The channel can be changed without changing 
the interfaces or components. 
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This structural hierarchy is described in the following SpecC model: 


01: interface L { void Write(int x); }; 

02: interface R { int Read(void); }; 

03: channel H implements L,R { 

04: int Data; bool Valid; 

05: void Write(int x) { Data=x; Valid=true; } 
06: int Read(void) { 


07: while (!Valid) waitfor (10); 

08: return (Data); 

09: } 

w ayes 

11: behavior Gi(in int p1, L p2, out int p3) { 

12 void main (void) { /*...*/ p2.Write(p1); } }; 
13: behavior G2(in int p1, R p2, out int p3) { 

14: void main(void) { /*...*/ p3=p2.Read(); } }; 
15: behavior G(in int p1, out int p2) { 

16: int hl; H h2; G1 g1(p1, h2, h1); G2 g2(h1, h2, p2); 
ilies void main (void) { 

18: par { gl.main(Q); g2.main(Q); } 

19: } 

20: 


Concurrent execution of sub-components is denoted by the keyword par in line 
18. As indicated in line 16, sub-components are communicating through integer h1 
and through channel h2. Note that the interface protocol implemented in channel 
H (see line 03), consisting of methods for read and write operations (lines 05 
and 06), can be changed without changing behaviors G1 and G2. For example, 
communication can be bit-serial or parallel, and the choice does not affect the 
models of G1 and G2. This is a necessary feature for reuse of hardware components 
or intellectual property (IP). The presented SpecC model does not include any 
timing information. Hence, it is a specification model (model of type A in Fig. 2.65). 

V 


The design flow for SpecC was already shown in Fig. 1.9 on p. 23. The path in 
Fig. 2.65 is A, B, D, F [83]. At the specification level, SpecC can model any kind of 
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communication and typically uses message passing. The communication model of 
SpecC has inspired the communication model in SystemC 2.0. 

Note that SpecC is based on C and C++ syntax. The reason for this is the 
following: There is the trend of implementing more and more functionality in 
software and using C for this purpose. For example, embedded systems implement 
standards such as MPEG 1/2/4 or decoders for mobile phone standards such as 
GSM, UMTS, or LTE. These standards are frequently available in the form of 
“reference implementations,” consisting of C programs not optimized for speed 
but providing the required functionality. The disadvantage of design methodologies 
based on special hardware description languages (like VHDL or Verilog, see below) 
is that these standards must be rewritten in order to generate systems. Further- 
more, simulating hardware and software together requires interfacing software and 
hardware simulators. Typically, this involves a loss of simulation efficiency and 
inconsistent user interfaces. Also, designers would need to learn several languages. 

Therefore, there has been a search for techniques for representing hardware 
structures in software languages. Some fundamental problems had to be solved 
before hardware could be modeled with software languages: 


e Concurrency, as it is found in hardware, has to be modeled in software. 
° There has to be a representation of simulated time. 

e Multiple-valued logic as described earlier must be supported. 

e Almost all useful hardware circuits should simulate deterministically. 


For the SpecC language, as well as for other hardware description languages, 
these problems were solved. 


2.7.5 SystemC 


TLM modeling and the separation between communication and computation are 
also available in SystemC’. SystemC (like SpecC) is based on C and C++. Similar 
to SpecC, SystemC provides channels, ports, and interfaces as abstract components 
for communication. The introduction of these mechanisms facilitates TLM. 
SystemC™ [243, 521] is a C++ class library. With SystemC, specifications can 
be written in C or C++, making appropriate references to the class library. SystemC 
comprises a notion of processes executed concurrently. Their execution is controlled 
by calls to wait primitives and sensitivity lists (lists of signals for which value 
changes start a re-execution of code). The sensitivity list concept includes dynamic 
sensitivity lists, i.e., the list of relevant signals can change during the execution. 
SystemC includes a model of time. Earlier SystemC 1.0 used floating-point num- 
bers to denote time. In the current standard, an integer model of time is preferred. 
SystemC also supports physical units such as nanoseconds and microseconds. 
SystemC data types include all common hardware types: four-valued logic ('@' , 
'1', 'X' and 'Z') and bitvectors of different lengths are supported. Writing digital 
signal processing applications is simplified due to available fixed-point data types. 
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Determinate behavior (see p. 58) of SystemC is not guaranteed in general, unless 
a certain modeling style is used. Using a command line option, the simulator can 
be directed to run processes in different orders. This way, the user can check if 
the simulation results depend on the sequence in which the processes are executed. 
However, for models of realistic complexity, only the presence of non-determinate 
behavior can be shown, not its absence. 

Transaction-level modeling with SystemC has been described by Montoreano 
[401]. The paper distinguishes only between two types of TLM models: 


¢ Loosely timed models are described as follows [401]: “These models have a 
loose dependency between timing and data, and are able to provide timing 
information and the requested data at the point when a transaction is being 
initiated. These models do not depend on the advancement of time to be able 
to produce a response. Normally, resource contention and arbitration are not 
modeled using this style. Due to the limited dependencies and minimal context 
switches, these models can be made to run the fastest and are particularly useful 
for doing software development on a Virtual Platform.” 

e Approximately timed models are described as follows [401]: “These models 
can depend on internal/external events firing and/or time advancing before they 
can provide a response. Resource contention and arbitration can be modeled 
easily with this style. Since these models must synchronize/order the transactions 
before processing them, they are forced to trigger multiple context switches in the 
simulation, resulting in performance penalties.” 


Hardware synthesis starting from SystemC has become available [215, 216]. 
A synthesizable subset of the language has been defined [8]. There are also 
commercial synthesis offerings. Commercial offerings are expected to support the 
synthesizable subset as a minimum. Methodology and applications for SystemC- 
based design are described in a book on that topic [407]. At the time of writing, the 
most recent version of SystemC is SystemC 2.3.1 [7]. 


2.7.6 VHDL 
Introduction 


VHDL is another HDL which is based on the discrete event paradigm. Unfor- 
tunately, it does not support a clear distinction between communication and 
computation, and reusing components is more difficult. However, VHDL is sup- 
ported by many industrial and academic tools and is in widespread use. Having 
presented an initial example of event-based modeling already on p. 87, we would 
like to delve deeper into VHDL. 

VHDL uses processes for modeling concurrency. Each process models one com- 
ponent of the potentially concurrent hardware. For simple hardware components, 
a single process may be sufficient. More complex components may need several 
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processes for modeling their operations. Processes communicate through signals. 
Signals roughly correspond to physical connections (wires). 

The origin of VHDL can be traced back to the 1980s. At that time, most design 
systems used graphical HDLs. The most common building block was the gate. 
However, in addition to using graphical HDLs, we can also use textual HDLs. The 
strength of textual languages is that they can easily represent complex computations 
including variables, loops, function parameters, and recursion. Accordingly, when 
digital systems became more complex in the 1980s, textual HDLs almost completely 
replaced graphical HDLs. Textual HDLs were initially a research topic at universi- 
ties. See Mermet et al. [392] for a survey of languages designed in Europe at that 
time. MIMOLA was one of these languages, and the author of this book contributed 
to its design and applications [373, 377]. Textual languages became popular when 
VHDL and its competitor Verilog (see p. 109) were introduced. 

VHDL was designed in the context of the VHSIC program of the Department of 
Defense (DoD) in the USA. VHSIC stands for very high speed integrated circuits. 
Initially, the design of VHDL (VHSIC hardware description language) was done 
by three companies: IBM, Intermetrics, and Texas Instruments. A first version of 
VHDL was published in 1984. Later, VHDL became an IEEE standard, called IEEE 
1076. The first IEEE version was standardized in 1987; updates were published in 
1993, in 2000, in 2002, and in 2008 [237, 239-242]. VHDL-AMS [245] allows 
modeling analog and mixed-signal systems by including differential equations in 
the language. The design of VHDL used Ada (see p. 111) as the starting point, since 
both languages were designed for the DoD. Since Ada is based on PASCAL, VHDL 
has some of the syntactical flavor of PASCAL. However, the syntax of VHDL is 
much more complex, and it is necessary not to get distracted by the syntax. In the 
current book, we will just focus on some concepts of VHDL which are useful also 
in other languages. A full description of VHDL is beyond the scope of this book. 
The standard is available from IEEE (see, e.g., [242]). 


Entities and Architectures 


VHDL, like all other HDLs, includes support for modeling concurrent operation 
of hardware components. Components are modeled by so-called design entities or 
VHDL entities. Entities contain processes used to model concurrency. According 
to the VHDL grammar, design entities are composed of two types of ingredients: an 
entity declaration and one or several architectures (see Fig. 2.67). 

For each entity, the most recently analyzed architecture will be used by default. 
The use of other architectures can be specified. Architectures may contain several 
processes. 


Example 2.30 We will discuss a full adder as an example. Full adders have three 
input ports and two output ports (see Fig. 2.68). 
An entity declaration corresponding to Fig. 2.68 is the following: 
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Fig. 2.67 Entity consisting of an entity declaration and architectures 


Fig. 2.68 Full adder and its 
interface signals a > —> sum 
b —> _ full_adder 
carry _in —3> m> carry_out 
entity full_adder is -- entity declaration 
port (a, b, carry_in: in BIT; -- input ports 
sum, carry_out: out BIT); -- output ports 


end full_adder; 


Two hyphens (--) are starting comments. They extend until the end of the line. V 


Architectures consist of architecture headers and architectural bodies. We can 
distinguish between different styles of bodies, in particular between structural and 
behavioral bodies. We will show how the two are different using the full adder as 
an example. Behavioral bodies include just enough information to compute output 


signals from input signals and the local state (if any), including the timing behavior 
of the outputs. 


Example 2.31 The following is an example of this: 


architecture behavior of full_adder is -- architecture 
begin 
sum <= (a xor b) xor carry_in after 10 ns; 


carry_out <= (a and b) or (a and carry_in) or 
(b and carry_in) after 10 ns; 
end behavior; 


VHDL-based simulators can display output signal waveforms resulting from 
stimuli applied to the inputs of the full adder described above. 

In contrast, structural bodies describe the way entities are composed of simpler 
entities. For example, the full adder can be modeled as an entity consisting of three 
components (see Fig. 2.69). These components are called i1 to i3 and are of type 
half_adder or or_gate. 

In the 1987 version of VHDL, these components must be declared in a so- 
called component declaration. This declaration is very similar to (and it serves the 
same purpose) forward declarations in other languages. This declaration provides 
the necessary information about the component even if the full description of that 
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Fig. 2.69 Schematic describing structural body of the full adder 


component is not yet stored in the VHDL database (this may happen in the case 
of so-called top-down designs). From the 1993 version of VHDL onward, such 
declarations are not required if the relevant components are already stored in the 
component database. 

Connections between local component and entity ports are described in port 
maps. The following VHDL code represents the structural body of Fig. 2.69: 


architecture structure of full_adder is -- architecture head 
component half_adder 
port (inl, in2: in BIT; carry: out BIT; sum: out BIT); 
end component; 
component or_gate 
port (inl, in2: in BIT; o: out BIT); 
end component; 


signal x, y, z: BIT; -- local signals 
begin -- port map section 
il: half_adder -- introduction of half_adder i1 

port map (a, b, x, y); -- connections between ports 
i2: half_adder port map (y, carry_in, z, sum); -- connections 
i3: or_gate port map (x, z, carry_out); -- connections 


end structure; 


Assignments 


Example 2.31 contains several assignments. Assignments are special cases of 
statements. In VHDL, there are two kinds of assignments: 


e Variable assignments: The syntax of variable assignments is 


variable := expression 


Whenever control reaches such an assignment, the expression is computed and 
assigned to the variable. Such assignments behave like assignments in common 
programming languages. 
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e Signal assignments: Signal assignments (as mentioned already on pages 88 and 
100) are evaluated concurrently. Signals and signal assignments are introduced in 
an attempt to model electrical signals in real hardware systems. Signals associate 
values with instances in time. In VHDL, such a mapping from time to values is 
represented by waveforms. Waveforms are computed from signal assignments. 
The syntax of signal assignments is 


signal <= expression; 

signal <= transport expression after delay; 

signal <= expression after delay; 

signal <= reject time inertial expression after delay; 


Whenever control reaches such an assignment, the expression is computed and 
used to extend predicted future values of the waveform. In VHDL, each signal 
is associated with a so-called signal driver. Computing the value resulting from 
the contributions of multiple drivers to the same signal is called resolution, and 
resulting values are computed by functions called resolution functions. In this 
way, the sup function mentioned in the context of CSA theory is implemented if 
signals are connected. 

In order to compute future values, simulators are assumed to include a 
queue of events to happen later than the current simulated time. This queue 
is sorted by the time at which future events (e.g., updates of signals) should 
happen. Executing a signal assignment results in the creation of entries in this 
queue. Each entry contains a time for executing the event, the affected signal, 
and the value to be assigned. For signal assignments not containing any after 
clause (first syntactical form), the entry will contain the current simulation time 
as the time at which this assignment has to be performed. In this case, the change 
will take place after an infinitesimally small amount of time, called 6-delay (see 
below). This allows us to update signals without changing macroscopic time. 

For signal assignments containing a transport prefix (second syntactical 
form), the update of the signal will be delayed by the specified amount. This 
form of the assignment is following the so-called transport delay model. This 
model is based on the behavior of simple wires: wires are (as a first order 
of approximation) delaying signals. Even short pulses propagate along wires. 
The transport delay model can be used for logic circuits, even though its main 
application is to model wires. 


Example 2.32 Suppose that we model a simple OR-gate using a transport delay 
signal assignment: 


c <= transport a or b after 10 ns; 


Such a model would propagate even short pulses (see Fig. 2.70). 
Output signal c includes a short pulse of 5 ns, which would be suppressed for 
an inertial delay model. V 
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Transport delay signal assignments will delete all entries in the queue 
corresponding to the time of the computed update or later times (if we first 
execute an assignment with a rather large delay and then execute an assignment 
with a smaller delay, then the entry resulting from the first assignment will be 
deleted). 

For signal assignments containing an after clause, but no transport clause, 
inertial delay is assumed. The inertial delay model reflects the fact that 
real circuits come with some “inertia.” This means that short spikes will be 
suppressed. For the third syntactical form of the signal assignment, all signal 
changes which are shorter than the specified delay are suppressed. For the fourth 
form, all signal changes which are shorter than the indicated amount are removed 
from the predicted waveform. The subtle rules for removals are not repeated here. 


Example 2.33 Suppose that we model a simple OR-gate using inertial delay: 


c <= a or b after 10 ns; 


For such a model, short spikes would be suppressed (see Fig. 2.71). 
There is no short pulse of 5 ns at c, but the 15 ns pulse arrives at the output. 
V 


VHDL Processes 


Assignments are just a shorthand for VHDL processes. More control over signal 
evaluations is available with processes. The general syntax for processes is as 
follows: 
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label: -- optional 
process 

declarations -- optional 
begin 

statements -- optional 


end process; 


In addition to assignments, processes may contain wait statements. Such 
statements can be used to explicitly suspend a process. These are the following 
kinds of wait statements: 


wait on signal list; -- suspend until one of the signals changes; 
wait until condition; -- suspend until condition is met 
wait for duration; -- suspend for specified interval; 
wait; -- suspend process indefinitely. 


As an alternative to explicit wait statements, a list of signals can be added to the 
process header. In that case, the process is activated whenever one of the signals in 
that list changes its value. 


Example 2.34 The following model of an AND-gate will execute its body once and 
will restart from the beginning every time one of the inputs changes its value: 


process(x,y) begin 
prod <= x and y ; 
end process; 


This model is equivalent to 


process begin 
prod <= x and y ; 
wait on x,y; 

end process; 


where there is an explicit wait statement at the end. V 


The VHDL Simulation Cycle 


According to the original standards document [237], the execution of a VHDL 
model is described as follows: 

“The execution of a model consists of an initialization phase followed by the 
repetitive execution of process statements in the description of that model. Each 
such repetition is said to be a simulation cycle. In each cycle, the values of all 
signals in the description are computed. If as a result of this computation an event 
occurs on a given signal, process statements that are sensitive to that signal will 
resume and will be executed as part of the simulation cycle.” 
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The initialization phase takes signal initializations into account and executes each 
process once. It is described in the standards as follows:7° 

“At the beginning of initialization, the current time, T, is assumed to be Ons. The 
initialization phase consists of the following steps:?! 


¢ The driving value and the effective value of each explicitly declared signal are 
computed, and the current value of the signal is set to the effective value. This 
value is assumed to have been the value of the signal for an infinite length of time 
prior to the start of the simulation. ... 

e Each...process in the model is executed until it suspends. ... 

¢ The time of the next simulation cycle (which in this case is the first simulation 
cycle), T, is calculated according to ... step (e) of the simulation cycle, below.” 


Each simulation cycle starts with setting the current time to the next time at 
which changes must be considered. This time T, was either computed during 
the initialization or during the last execution of the simulation cycle. Simulation 
terminates when the current time reaches its maximum, T/ME'HIGH. The 
standard describes the simulation cycle as follows: 

“A simulation cycle consists of the following steps: 


(a) The current time, T, is set equal to T,. Simulation is complete when T, = 
TIME'HIGH and there are no active drivers or process resumptions at Ty. 

(b) Each active explicit signal in the model is updated. (Events may occur as a 
result.)” ... 

In the cycle preceding the current cycle, future values for some signals have 
been computed. If Te corresponds to the time at which these values become 
valid, they are now assigned. Values of newly computed signals are not assigned 
before the next simulation cycle, at the earliest. Signals that change their value 
generate events which, in turn, may release processes that are sensitive to that 
signal. 

(c) “For each process P, if P is currently sensitive to a signal S and if an event 
has occurred on S in this simulation cycle, then P resumes. 

(d) Each ...process that has resumed in the current simulation cycle is executed 
until it suspends. 

(e) Tn (the time of the next simulation cycle) is set to the earliest of 


1. TIME’ HIGH (this is the end of simulation time). 

2. The next time at which a driver becomes active (this is the next instance in 
time, at which a driver specifies a new value), or 

3. The next time at which a process resumes (as computed from wait for 
statements). 


If Ta = Te, then the next simulation cycle (if any) will be a delta cycle.” 


20We leave out the discussion of implicitly declared signals and so-called postponed processes. 
21 Some sections of the standard are omitted in the citation (indicated by “...”). 
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Fig. 2.72 VHDL simulation Start of simulation 
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Fig. 2.73 RS flipflop 
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The iterative nature of simulation cycles is shown in Fig. 2.72. 
Delta (5) simulation cycles have been the source of discussions. They introduce 
infinitesimally small delay if the user did not specify any. 


Example 2.35 Let us come back to our latch example and look more closely at 
timing. Figure 2.73 shows the latch again, this time using standard schematic 
symbols. 


The flipflop is modeled in VHDL as follows: 


entity RS_Flipflop is 


port (R: in BIT; -- reset 
S: in BIT; -- set 
Q: inout BIT; -- output 
nQ: inout BIT); -- Q-bar 


end RS_Flipflop; 
architecture one of RS_Flipflop is 
begin 
process: (R,S,Q,nQ) 
begin 
Q <= R nor nQ; nQ <= S nor Q; 
end process; 
end one; 


Ports Q and nQ must be of mode inout since they are also read internally, which 


would not be possible if they were of mode out. Table 2.5 shows the simulation 
times at which signals are updated for this model. During each cycle, updates are 
propagated through one of the gates. Simulation terminates after three ô cycles. The 
last cycle does not change anything, since Q is already 'Q'. V 


ô cycles correspond to an infinitesimally small unit of time, which will always 


exist in reality. 6 cycles ensure that simulation respects causality. 
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Table 2.5 ô cycles for RS <Ons |Ons |Ons +8 |Ons+2*6 |Ons+3*6 
flipflop 
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The results do not depend on the order in which parts of the model are executed 
by the simulation. This feature is enabled by the separation between the computation 
of new values for signals and their actual assignment. In a model containing the lines 


a <= b; 
by -<="a; 
signals a and b will always be swapped. If the assignments were performed 
immediately, the result would depend on the order in which we execute the 
assignments (see also p. 57). WHDL models are therefore determinate. This is 
what we expect from the simulation of a real circuit with a fixed behavior. 

There can be arbitrarily many ô cycles before the current time To is advanced. 
This possibility of infinite loops can be confusing. One way of avoiding this 
possibility is to disallow zero delays, which we used in our model of the flipflop. 

The propagation of values using signals also allows an easy implementation of 
the observer pattern (see p. 35). In contrast to SDF, the number of observers can 
vary, depending on the number of processes waiting for changes on a signal. 

What is the communication model behind VHDL? The description of the seman- 
tics of VHDL relies heavily on a single, centralized queue of future events, storing 
values of all signals in the future. The purpose of this queue is not to implement 
asynchronous message passing. Rather, this queue is supposed to be accessed by 
the simulation kernel, one entry at a time, in a non-distributed fashion. Attempts 
to perform distributed VHDL simulations are typically suffering from a poor 
performance. All modeled components can access values of signals and variables 
which are in their scope without any message-based communication. Therefore, we 
tend toward associating VHDL with a shared memory-based implementation of the 
communication. However, FIFO-based message passing could be implemented in 
VHDL on top of the VHDL simulator as well. 


IEEE 1164 


In VHDL, there is no predefined number of signal values, except for some basic 
support for two-valued logic. Instead, the used value sets can be defined in VHDL 
itself, and different VHDL models can use different value sets. 

However, portability of models would suffer in a very severe manner if this 
capability of VHDL was applied in this way. In order to simplify exchanging 
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VHDL models, a standard value set was defined and standardized by the IEEE. 
This standard is called IEEE 1164 and is employed in many system models. IEEE 
1164 has nine values: {'@', '1', 'L', 'H', 'X', 'W', 'Z', 'U', '-'}. The first seven 
values correspond to the seven signal values described in Sect. 2.7.2. 'U' denotes an 
uninitialized value. It is used by simulators for signals that have not been explicitly 
initialized. 

'-' denotes the input don’t care. This value needs some explanation. Fre- 
quently, hardware description languages are used for describing Boolean functions. 
The VHDL select statement is a very convenient means for doing that. The select 
statement corresponds to switch and case statements found in other languages, and 
its meaning is different from the select statement in Ada (see p. 112). 


Example 2.36 Suppose that we would like to represent the Boolean function: 
f(a, b,c) = ab + bc 


Furthermore, suppose that f should be undefined for the case of a = b = c ='0'. 
A very convenient way of specifying this function would be the following: 


f <= select a&b&c -- & denotes concatenation 
'1' when '10-' -- corresponds to first term 
'1' when '-11' -- corresponds to second term 
'X' when 'Q00' 


This way, functions given above could be easily translated into VHDL. Unfortu- 
nately, the select statement denotes something completely different. Since IEEE 
1164 is just one of a large number of possible value sets, it does not include any 
knowledge about the “meaning” of '-'. Whenever VHDL tools evaluate select 
statements such as the one above, they check if the selecting expression (a&b&c 
in the case above) is equal to the values in the when clauses. In particular, they 


check if, e.g., a&b &c is equal to '10-'. In this context, '-' behaves like any other 
value: VHDL systems check if c has a value of '-'. Since '-' is never assigned to 
any of the variables, these tests will never be true. V 

Therefore, '-' is of limited benefit. The non-availability of convenient input 


don’t care values is the price that one has to pay for the flexibility of defining value 
sets in VHDL itself.” 

The nice property of the general discussion on pages 89 to 93 is the following: 
it allows us to immediately draw conclusions about the modeling power of IEEE 
1164. The IEEE standard is based on the seven-valued value set described on p. 91 


22This problem was corrected in VHDL 2006 [341]. 
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and, therefore, is capable of modeling circuits containing depletion transistors. It is, 
however, not capable of modeling charge storage.” 


2.7.7 Verilog and SystemVerilog 


Verilog[539] is another hardware description language. Initially it was a proprietary 
language, but it was later standardized as an IEEE standard 1364, with versions 
called IEEE standard 1364-1995 (Verilog version 1.0) and IEEE standard 1364— 
2001 (Verilog 2.0). Some features of Verilog are quite similar to VHDL. Just 
like in VHDL, designs are described as a set of connected design entities, and 
design entities can be described behaviorally. Also, processes are used to model 
concurrency of hardware components. Just like in VHDL, bitvectors and time units 
are supported. There are, however, some areas in which Verilog is less flexible and 
focuses more on comfortable built-in features. For example, standard Verilog does 
not include the flexible mechanisms for defining enumerated types such as the ones 
defined in the IEEE 1164 standard. However, support for four-valued logic is built 
into the Verilog language, and the standard IEEE 1364 also provides multiple-valued 
logic with eight different signal strengths. Multiple-valued logic is more tightly 
integrated into Verilog than into VHDL. The Verilog logic system also provides 
more features for transistor-level descriptions. However, VHDL is more flexible. 
For example, VHDL allows hardware entities to be instantiated in loops. This can 
be used to generate a structural description for, e.g., n-bit adders without having to 
specify n adders and their interconnections manually. 

Verilog has a similar number of users as VHDL. While VHDL is more popular 
in Europe, Verilog is more popular in the USA. 

Verilog versions 3.0 and 3.1 are also known as SystemVerilog. They include 
numerous extensions to Verilog 2.0. These extensions include [244, 517]: 


e additional language elements for modeling behavior, 

e C data types such as int and type definition facilities such as typedef and 
struct, 

e definition of interfaces of hardware components as separate entities, 

e standardized mechanism for calling C/C++ functions and, to some extent, to call 
built-in Verilog functions from C, 

e significantly enhanced features for describing an environment (called test bench) 
for the hardware circuit under design (called CUD), and for using the test bench 
to validate the CUD by simulation, 

e classes known from object-oriented programming for use within test benches, 

e dynamic process creation, 


23 As an exception, if the capability of modeling depletion transistors or pull-up resistors is not 
needed, one could interpret weak values as stored charges. This is, however, not very practical 
since pull-up resistors are found in most actual systems. 
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e standardized interprocess communication and synchronization, including sema- 
phores, 

e automatic memory allocation and deallocation, 

e language features that provide a standardized interface to formal verification (see 
p. 239). 


Due to the capability of interfacing with C and C++, interfacing to SystemC 
models is also possible. Improved facilities for simulation based and formal 
verification based design validation and the possible interfacing to SystemC create 
a good acceptance. 


2.8 von Neumann Languages 


The sequential execution and explicit control flow of von Neumann languages 
are their common characteristics. Also, such languages allow an almost unre- 
stricted access to global variables, and we may need explicit communication and 
synchronization. Model-based design using CFSMs and computational graphs is 
very appropriate for embedded system design. Nevertheless, the use of standard 
von Neumann languages is still widespread. Therefore, we cannot ignore these 
languages. Also, the distinction between models like KPNs and properly restricted 
von Neumann languages is blurring. For KPNs, we do also have sequential 
execution of the code for each of the nodes. We are still keeping the distinction 
between KPN and von Neumann languages since the KPN style of modeling has its 
advantages like determinate execution. 

For the first two languages covered next, communication is built into the 
languages. For the remaining languages, focus is on the computations, and com- 
munication can be replaced by selecting different libraries. 


2.8.1 CSP 


CSP (communicating sequential processes) [217] is one of the first languages 
comprising mechanisms for interprocess communication. Communication is based 
on channels. 


Example 2.37 Consider input/output for channel c in this example: 


process A process B 

var a var b 

a 2= 33 Mack 

cla; -- output to channel c c?b; -- input from channel c 


end; end; 
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Both processes will wait for the other process to arrive at the input or output 
statement. This is a case of rendezvous-based, blocking, or synchronous message 
passing. V 


CSP is determinate, since it relies on the commitment to wait for input from 
a particular channel, like in Kahn process networks. CSP has laid the foundation 
for the OCCAM language that was proposed as a programming language of the 
transputer [435]. The focus on communication channels has been picked up again 
in the design of the XS1 processor [603]. 


2.8.2 Ada 


During the 1980s, the Department of Defense (DoD) in the USA realized that the 
dependability and maintainability of the software in its military equipment could 
soon become a major source of problems, unless some strict policy was enforced. 
It was decided that all software should be written in the same real-time language. 
Requirements for such a language were formulated. 

No existing language met the requirements, and, consequently, the design of 
a new one was started. The language which was finally accepted was based on 
PASCAL. It was called Ada (after Ada Lovelace, regarded as being the first (female) 
programmer). Ada’95 [80, 287] is an object-oriented extension of the original 
standard. 

One of the interesting features of Ada is the ability to have nested declarations of 
processes (called tasks in Ada). Tasks are started whenever control passes into the 
scope in which they are declared. 


Example 2.38 The following code has been adopted from Burns et al. [80]: 


procedure example1 is 
task a; 
task b; 
task body a is 
-- local declarations for a 
begin 
-- statements for a 
end a; 
task body b is 
-- local declarations for b 
begin 
-- statements for b 
end b; 
begin 
-- body of procedure example1 
end; 


Tasks a and b will start before the first statement of the code of example1. V 
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The communication concept of Ada is another key concept. It is based on the syn- 
chronous rendezvous paradigm. Whenever two tasks want to exchange information, 
the task reaching the “meeting point” first has to wait until its partner has also 
reached a corresponding point of control. Syntactically, procedures are used for 
describing communication. Procedures which can be called from other tasks must 
be identified by the keyword entry. 


Example 2.39 This code has also been adopted from Burns et al. [79]: 
task screen_out is 


entry call (val: character; x, y: integer); 
end screen_out; 


Task screen_out includes a procedure named call which can be called from 
other processes. Some other task can call this procedure by prefixing it with the 
name of the task: 


screen_out.call(’Z’ , 10,20); 


The calling task has to wait until the called task has reached a point of control, 
at which it accepts calls from other tasks. This point of control is indicated by the 
keyword accept: 


task body screen_out is 

begin 
nea call (val: character; x, y: integer) do 
end call; 

Bnd sereen awts 


Obviously, task screen_out may be waiting for several calls at the same time. 
The Ada select statement provides this capability: 


task screen_output is 

entry call_ch(val: character; x, y: integer); 
entry call_int(z, x, y: integer); 

end screen_out; 

task body screen_output is 


select 


accept call_ch ... do... 
end call_ch; 

or 
accept call_int ... do .. 


end call_int; 
end select; 
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In this case, task screen_out will be waiting until either call_ch or call_int 
is called. V 


Due to the presence of the select statement, Ada is not determinate. Ada has been 
the preferred language for military equipment produced in the Western hemisphere 
for some time. Information about Ada is available from a number of web sites (see, 
e.g., [288]). 


2.8.3 Communication Libraries 


Standard von Neumann languages do not come with built-in communication primi- 
tives. However, communication can be provided by libraries. There is a trend toward 
supporting communication within some local system as well as communication over 
longer distances. The use of Internet protocols is becoming more popular. 


MPI 


Multi-core programming with imperative programs is possible with the message 
passing interface MPI. MPI is a very frequently used library, initially designed 
for high-performance computing. It allows a choice between synchronous and 
asynchronous message passing. For example, synchronous message passing is 
possible with the MPI_Send library function [395]: 

MPI_Send(buffer, count, type, dest, tag,comm) where: 


e buffer is the address of data to be sent, 

e count is the number of data elements to be sent, 

e type is the data type of data to be sent (e.g., MPILCHAR, MPI_SHORT, MPI_INT), 

e dest is the process id of the target process, 

e tag is a message id (for sorting incoming messages), 

e comm is the communication context (set of processes for which destination field 
is valid), and 

e function result indicates success. 


The following is an asynchronous library function: 
MPI_Isend(buffer, count, type, dest, tag, comm, request) where 


e buffer, count, type, dest, tag, comm are same as above, and 

e the system issues a unique “request number”. The programmer uses this system 
assigned “handle” later (in a WAIT type routine) to determine completion of the 
non-blocking operation. 


For MPI, the partitioning of computations among various processors must be 
done explicitly, and the same is true for the communication and the distribution of 
data. Synchronization is implied by communication, but explicit synchronization is 
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also possible. As a result, much of the management code is explicit and causes a 
major amount of work for the programmer. Also, it does not scale well when the 
number of processors is significantly changed [554]. 

In order to apply the MPI style of communication to real-time systems, a real- 
time version of MPI, called MPI/RT, has been defined [501]. MPI/RT does not cover 
issues such as thread creation and termination. MPI/RT is conceived as a potential 
layer between the operating system and standard (non-real-time) MPI. 

MPI is available on a variety of platforms and also considered for multiple 
processors on a chip. However, it is based on the assumption that memory accesses 
are faster than communication operations. Also, MPI is mainly targeting at homo- 
geneous multiprocessors. These assumptions are not true for multiple processors on 
a chip. 

MPI has recently been extended to cover shared memory-based communication 
as well. 


OpenMP 


OpenMP is a compiler-based solution for shared memory-based communication. 
For OpenMP, parallelism is mostly explicit, whereas computation partitioning, 
communication, synchronization, etc. are implicit. Parallelism is expressed with 
pragmas: for example, loops can be preceded by pragmas indicating that they should 
be parallelized. 


Example 2.40 The following program demonstrates a small parallel loop [439]: 


void al(int n, float xa, float xb) { 


int i; 
#pragma omp parallel for 
for (i=1; i<n; i++) /* i is private by default */ 
bli] = Cali] + ali-1]) / 2.0; 
} 
Note that a simple pragma is sufficient to indicate parallel programming. V 


This means that OpenMP requires a relatively small amount of effort for paralleliza- 
tion for the user. However, this also means that the user cannot control partitioning 
[554]. There are some applications for MPSoCs (see, e.g., Marian et al. [366]). 

More techniques for multi-core programming will be described in the section on 
system software (see p. 232). 


2.8.4 Additional Languages 


The Java language was not designed for embedded systems. There have been 
attempts to solve some of the resulting problems [12, 270]. However, Android and 
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Java for smart cards are the only major applications of Java in small systems. At 
the time of writing, the JTRES workshop on “Java Technologies for Real-time and 
Embedded Systems” (siehe http://jtres2016.compute.dtu.dk/) reflects the latest state 
of the art in using Java for such systems. 

Pearl [127] was designed for industrial control applications. It does include a 
large repertoire of language elements for controlling processes and referring to time. 
It requires an underlying real-time operating system. Pearl has been very popular in 
Europe, and a large number of industrial control projects have been implemented 
in Pearl. Pearl supports semaphores which can be used to protect communication 
based on shared buffers. 

Chill [592] was designed for telephone exchange stations. It was standardized by 
the CCITT and used in telecommunication equipment. Chill is a kind of extended 
PASCAL. 

IEC 60848 [231] and STEP 7 [488] are specialized languages that are used in 
control applications. Both provide graphical elements for describing the system 
functionality. 


2.9 Levels of Hardware Modeling 


In practice, designers start design cycles at various levels of abstraction. In some 
cases, these are high levels describing the overall behavior of the system to be 
designed. In other cases, the design process starts with the specification of electrical 
circuits at lower levels of abstraction. For each of the levels, a variety of languages 
exists, and some languages cover various levels. In the following, we will describe a 
set of possible levels. Some lower-end levels are presented here for context reasons. 
Specifications should not start at those levels. The following is a list of frequently 
used names and attributes of levels: 


e System-level models: The term system level is not clearly defined. It is used here 
to denote the entire embedded system and the system into which information 
processing is embedded (“the product”) and possibly also the environment (the 
physical input to the system, reflecting, e.g., the roads and weather conditions). 
Obviously, such models include mechanical as well as information processing 
aspects, and it may be difficult to find appropriate simulators. Possible solutions 
include VHDL-AMS (the analog extension to VHDL), Verilog-AMS, SystemC, 
Modelica, COMSOL (see https://www.comsol.com/), or MATLAB/Simulink. 
MATLAB/Simulink and VHDL-AMS support modeling partial differential equa- 
tions, which is a key requirement for modeling mechanical systems. It is a 
challenge to model information processing parts of the system in such a way 
that the simulation model can also be used for the synthesis of the embedded 
system. If this is not possible, error-prone manual translations between different 
models may be needed. 
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e Algorithmic level: At this level, we are simulating the algorithms that we intend 
to use within the embedded system. For example, we might be simulating MPEG 
video encoding algorithms in order to evaluate the resulting video quality. For 
such simulations, no reference is made to processors or instruction sets. 

Data types may still allow a higher precision than the final implementation. 
For example, MPEG standards use double-precision floating-point numbers. The 
final embedded system will hardly include such data types. If data types have 
been selected such that every bit corresponds to exactly one bit in the final 
implementation, the model is said to be bit-true. Translating non-bit-true into 
bit-true models should be done with tool support (see p. 357). 

Models at this level may consist of single processes or of sets of cooperating 
processes. 

¢ Instruction set level: In this case, algorithms have already been compiled for 
the instruction set of the processor(s) to be used. Simulations at this level allow 
counting the executed number of instructions. There are several variations of the 
instruction set level: 


— In a coarse-grained model, only the effect of the instructions is simulated, 
and their timing is not considered. The information available in assembly 
reference manuals (instruction set architecture (ISA)) is sufficient for defining 
such models. 

— Transaction-level modeling: In transaction-level modeling (see also p. 93), 
transactions, such as bus reads and writes, and communication between 
different components are modeled. Transaction-level modeling includes fewer 
details than cycle-true modeling (see below), enabling significantly superior 
simulation speeds [105]. 

— In a more fine-grained model, we might have cycle-true instruction set 
simulation. In this case, the exact number of clock cycles required to run an 
application can be computed. Defining cycle-true models requires a detailed 
knowledge about processor hardware in order to correctly model, for example, 
pipeline stalls, resource hazards, and memory wait cycles. 


e Register-transfer level (RTL): At this level, we model all the components 
at the register-transfer level, including arithmetic/logic units (ALUs), registers, 
memories, multiplexers, and decoders. Models at this level are always cycle-true. 
Automatic synthesis from such models is not a major challenge. 

e Gate-level models: In this case, models contain gates as the basic components. 
Gate-level models provide accurate information about signal transition probabili- 
ties and can therefore also be used for power estimations. Also delay calculations 
can be more precise than for the RTL. However, typically no information about 
the length of wires and hence no information about capacitances is available. 
Hence, delay and power consumption calculations are still estimates. 

The term “gate-level model” is sometimes also employed in situations in 
which gates are only used to denote Boolean functions. Gates in such a model 
do not necessarily represent physical gates; we are only considering the behavior 
of the gates, not the fact that they also represent physical components. More 
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precisely, such models should be called “Boolean function models,”** but this 
term is not frequently used. 

Switch-level models: Switch-level models use switches (transistors) as their 
basic components. Switch-level models use digital values models (refer to p. 88 
for a description of possible value sets). In contrast to gate-level models, switch- 
level models are capable of reflecting bidirectional transfer of information. 
Switch-level models can be simulated with ternary simulation [72]. 
Circuit-level models: Circuit theory and its components (current and voltage 
sources, resistors, capacitances, inductances, and frequently possible macro- 
models of semiconductors) form the basis of simulations at this level. Simu- 
lations involve partial differential equations. These equations are linear if and 
only if the behavior of semiconductors is linearized (approximated). The most 
frequently used simulator at this level is SPICE [557] and its variants. 

Layout models: Layout models reflect the actual circuit layout. Such models 
include geometric information. Layout models cannot be simulated directly, 
since the geometric information does not directly provide information about 
the behavior. Behavior can be deduced by correlating the layout model with a 
behavioral description at a higher level or by extracting circuits from the layout, 
using knowledge about the representation of circuit components at the layout 
level. 

In a typical design flow, the length of wires and the corresponding capac- 
itances are extracted from the layout and back-annotated to descriptions at 
higher levels. This way, more precision can be gained for delay and power 
estimations. Also, layout information may be essential for thermal modeling. 
Process and device models: At even lower levels, we can model fabrication 
processes. Using information from such models, we can compute parameters 
(gains, capacitances, etc) for devices (transistors). Due to a growing complexity 
of the fabrication process, these models are also becoming more complex. 


2.10 Comparison of Models of Computation 


2.10.1 Criteria 


Models of computation can be compared according to several criteria. For example, 
Stuijk [515] compares MoCs according to the following criteria: 


Expressiveness and succinctness indicate which systems can be modeled and 
how compact they are. 

Analyzability relates to the availability of schedulability tests and scheduling 
algorithms. Also, analyzability is affected by the need for run-time support. 


*4These models could be represented with binary decision diagrams (BDDs) [571]. 
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Expressiveness and succinctness 


o Kahn process networks 
e SDF 
x Homogeneous SDF (HSDF) 


Analyzability Implementation efficiency 


Fig. 2.74 Comparison between data-flow models 


° The implementation efficiency is influenced by the required scheduling policy 
and the code size. 


Figure 2.74 classifies data-flow models according to these criteria. 

This figure reflects the fact that Kahn process networks are expressive: they 
are Turing-complete, meaning that any problem which can be computed on a 
Turing machine can also be computed in a KPN. Turing machines are used as 
the standard model of universal computers [214]. However, termination properties 
and upper bounds on buffer sizes of KPNs are difficult to analyze. While Kahn 
process networks are Turing-complete, cyclo-static data flow (CSDF, see p. 74) is 
not Turing-complete. Also, SDF graphs are not Turing-complete. The underlying 
reason is that they cannot model control flow. However, deadlock properties and 
upper bounds on buffer sizes of SDF graphs are easier to analyze. Homogeneous 
SDF (HSDF) graphs (graphs for which all rates are equal to one) are even less 
expressive but also easier to analyze. 

We could compare MoCs also with respect to the type of processes supported: 


¢ The number of processes can be either static or dynamic. A static number of 
processes simplifies the implementation and is sufficient if each process models 
a piece of hardware and if we do not consider “hot-plugging” (dynamically 
changing the hardware architecture). Otherwise, dynamic process creation (and 
termination) should be supported. 

e Processes can either be statically nested or all declared at the same level. For 
example, StateCharts allows nested process declarations, while SDL (see p. 62) 
does not. Nesting provides encapsulation of concerns. 

e Different techniques for process creation exist. Process creation can result from 
an elaboration of the process declaration in the source code, through the fork 
and join mechanism (supported for example in Unix) and also through explicit 
process creation calls. 


The expressiveness of different data flow-oriented models of computation is also 
shown in Fig. 2.75 [42]. MoCs not discussed in this book are indicated by dashed 
lines. 
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Fig. 2.75 Expressiveness of 
data-flow models 


Table 2.6 Language comparison 
Behavioral | Structural | Programming Exceptions | Dynamic process l 

Language | hierarchy hierarchy language elements | supported creation 

‘StateCharts + | z o=o e ë be 

VHDL + + + = = 

SDL += +— +— = + 

Petri nets = = = = + 

Java + = + + + 

SpecC + + + + + 


SystemC + + + + + 
Ada + — + + + 


None of the MoCs and languages presented so far meets all the requirements for 
specification languages for embedded systems. Table 2.6 presents an overview over 
some of the key properties of some of the languages. 

Interestingly, SpecC and SystemC meet all listed requirements. However, some 
other requirements (like a precise specification of deadlines, etc.) are not included. 
It is not very likely that a single MoC or language will ever meet all requirements, 
since some of the requirements are essentially conflicting. A language supporting 
hard real-time requirements may well be inconvenient to use for less strict real-time 
requirements. A language appropriate for distributed control-dominated applica- 
tions may be poor for local data-flow-dominated applications. Hence, we can expect 
that we will have to live with compromises and possibly with mixed models. 

Which compromises are actually used in practice? In practice, assembly lan- 
guage programming was very common in the early years of embedded systems 
programming. Programs were small enough to handle the complexity of problems 
in assembly languages. The next step was the use of C or derivatives of C. Due to 
the increasing complexity of embedded system software, higher-level languages are 
to follow the introduction of C. Object-oriented languages and SDL are languages 
which provide the next level of abstraction. Also, languages like UML are required 
to capture specifications at an early design stage. The trend is to move toward model- 
based designs [477]. In practice, languages can be used like shown in Fig. 2.76. 

According to Fig. 2.76, languages like SDL or StateCharts can be translated into 
C. These C descriptions are then compiled. Starting with SDL or StateCharts also 
opens the way to implementing the functionality in hardware, if translators from 
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Fig. 2.76 Using various languages in combination 


these languages to VHDL are provided. Both C and VHDL will certainly survive as 
intermediate languages for many years. Java does not need intermediate steps but 
does also benefit from good translation concepts to assembly languages. In a similar 
way, translations between various graphs are feasible. For example, SDF graphs can 
be translated into a subclass of Petri nets [515]. Also, they correspond to a subclass 
of the computation graph model proposed by Karp and Miller [282]. Linking the 
various models of computation is facilitated by formal techniques [95]. 

Several languages for embedded system design are covered in a book edited 
by M. Radetzki [464]. Popovici et al. [457] use a combination of Simulink and 
SystemC. 

We have skipped the discussion of algebraic languages like LOTOS [256] and 
Z [504]. These languages enable precise specifications and formal proofs, but they 
are not executable. 


2.10.2 Unified Modeling Language (UML) 


UML" is a language including diagrams reflecting several MoCs. Table 2.7 classi- 
fies the UML diagrams mentioned so far with respect to our table of MoCs. 

This figure shows how UML covers several models of computation, with a 
focus on early design phases. Semantics of communication is typically imprecisely 
defined. Therefore, our classification cannot be precise in this respect. In addition to 
the diagrams already mentioned, the following diagrams can be modeled: 


¢ Deployment diagrams: These diagrams are important for embedded systems. 
They describe the “execution architecture” of systems (hardware or software 
nodes). 

e Package diagrams: Package diagrams represent the partitioning of software into 
software packages. They are similar to module charts in StateMate. 

e Class diagrams: These diagrams describe inheritance relations of object classes. 
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Table 2.7 Models of computation available in UML 


Communication/ Shared memory Message passing 
organization of components Synchronous Asynchronous 
Undefined components Use cases 


Sequence charts, timing diagrams 


Differential equations - 


Finite state machines State diagrams - - 
Data flow - Data-flow diagrams 
Petri nets (Not useful) Activity charts 


Distributed event model — — 
von Neumann model — — 


e Communication diagrams (called Collaboration diagrams in UML™ 1.x): 
These graphs represent classes, relations between classes, and messages that are 
exchanged between them. 

e Component diagrams: They represent the components used in applications or 
systems. 

e Object diagrams, interaction overview diagrams, composite structure dia- 
grams: This list consists of three types of diagrams which are less frequently 
used. Some of them may actually be special cases of other types of diagrams. 


Available tools provide some consistency checking between the different dia- 
gram types. Complete checking, however, seems to be impossible. One reason 
for this is that the semantics of UML initially was left undefined. It has been 
argued that this was done intentionally, since one does not like to bother about 
the precise semantics during the early phases of the design. As a consequence, 
precise, executable specifications can only be obtained if UML is combined with 
some other, executable language. Available design tools have combined UML with 
SDL [227] and C++. There are, however, also some first attempts to define the 
semantics of UML. 

Version 1.4 of UML was not designed for embedded systems. Therefore, it 
lacks a number of features required for modeling embedded systems (see p. 29). 
In particular, the following features are missing [386]: 


e the partitioning of software into tasks and processes cannot be modeled, 
e timing behavior cannot be described at all, 
e the presence of essential hardware components cannot be described. 


Due to the increasing amount of software in embedded systems, UML is gaining 
importance for embedded systems as well. Hence, several proposals for UML 
extensions to support real-time applications have been made [137, 386]. These 
extensions have been considered during the design of UML 2.0. UML 2.0 includes 
13 diagram types (up from nine in UML 1.4) [13]. Special profiles are taking 
the requirements of real-time systems into account [368]. Profiles include class 
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diagrams with constraints, icons, diagram symbols, and some (partial) semantics. 
There are UML profiles for [368]: 


e Schedulability, Performance, and Time Specification (SPT) [429], 

e Testing [431], 

e Quality of Service (QoS) and Fault Tolerance [431], 

e a Systems Modeling Language called SysML [434], 

e Modeling and Analysis of Real-Time Embedded Systems (MARTE), [430] 
e UML and SystemC interoperability [469], 

e the SPRINT profile for reuse of intellectual property (IP) [505]. 


Using such profiles, we can—for example—attach timing information to 
sequence charts. However, profiles may be incompatible. Also, UML has been 
designed for modeling and frequently leaves too many semantical issues open to 
allow automatic synthesis of implementations [368]. 


2.10.3 Ptolemy IT 


The Ptolemy project [460] focuses on modeling, simulation, and design of hetero- 
geneous systems. Emphasis is on embedded systems that mix different technologies 
and, accordingly, also MoCs. For example, analog and digital electronics, hardware 
and software, and electrical and mechanical devices can be described. Ptolemy 
supports different types of applications, including signal processing, control appli- 
cations, sequential decision-making, and user interfaces. Special attention is paid 
to the generation of embedded software. The idea is to generate this software from 
the MoC which is most appropriate for a certain application. Version 2 of Ptolemy 
(Ptolemy II) supports the following MoCs and corresponding domains (see also 
p. 40): 


1. Communicating sequential processes (CSP). 

2. Continuous time (CT): This model is appropriate for mechanical systems and 
analog circuits. Hence, this model supports differential equations. Tools include 
extensible differential equation solvers. 

3. Discrete event model (DE): this is the model used by many simulators, e.g., 
VHDL simulators. 

4. Distributed discrete events (DDE). Discrete event systems are difficult to simu- 
late in parallel, due to the inherent centralized queue of future events. Attempts 
to distribute this data structure have not been very successful so far. Therefore, 
this special (experimental) domain is introduced. Semantics can be defined such 
that distributed simulation becomes more efficient than in the DE model. 

. Finite state machines (FSM). 

. Process networks (PN), using Kahn process networks (see p. 69). 

7. Synchronous data flow (SDF). 


nm 
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8. Synchronous/reactive (SR) MoC: This model uses discrete time, but signals do 
not need to have a value at every clock tick. Esterel (see p. 61) is a language 
following this style of modeling. 


This list shows the focus on different models of computation in the Ptolemy project. 


2.11 Problems 


We suggest solving the following problems at home or during a flipped classroom 
session: 


2.1 What is a (design) model? 


2.2 Prepare a list of up to six requirements for specification/modeling languages 
for embedded systems! 


2.3 Why could our specification lead to deadlocks? 

2.4 What is a “model of computation (MoC)”? 

2.5 What is a “job” and how is it different from “tasks”? 

2.6 Which are the two key techniques for communication in computers? 


2.7 Which description techniques can be used for capturing initial ideas about the 
system to be designed? 


2.8 Simulate trains between Paris, Brussels, Amsterdam, and Cologne, using the 
levi simulation software [498]! Modify the examples included with the software 
such that two independent tracks exist between any two stations and demonstrate 
an (arbitrary) schedule involving ten trains! 


2.9 Download the OpenModelica simulation software. Develop a simulation model 
for Newton’s cradle (see, e.g., https://en.wikipedia.org/wiki/Newton%27s_cradle). 


2.10 Modify the answering machine of Example 2.8 such that the owner can 
intervene at any time during the playing of pre-corded text or the recording of the 
message. 


2.11 Model your daily schedule with a timed automaton. Hours are reflected by a 
variable h, days by a variable d. d = 1 means Monday, d = 7 means Sunday. On a 
weekend (d = 6 ord = 7), you leave the sleeping state between h = 10 and h = 11, 
spend 1-2 h getting yourself ready for the day, stay with your friend until some time 
in the range h = 20 to h = 21, and walk back home and enter the sleeping state 
between h = 22 and h = 23. During the week (d = 1 or ...or d = 5), you leave 
the sleeping state between h = 7 and h = 8, spend 1-2h getting yourself ready for 
the day, study until some time in the range h = 20 to h = 21, and walk back home 
and enter the sleeping state between h = 22 and h = 23. Model your schedule! Do 
not forget to increase the day d at the end of each day. 
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Fig. 2.77 StateCharts example: left, graphical model; right, table of states 


2.12 Suppose the StateCharts model in Fig. 2.77 (left) model is given. 

Also, suppose that we have the following sequence of input events: bcfhghea 
b c. In the diagram in Fig. 2.77 (right), mark all the states the StateCharts model will 
be in after a particular input has been applied! H denotes the history mechanism. 


2.13 Are StateCharts determinate models if we follow the StateMate semantics? 
Please explain your answer! 


2.14 Is SDL a determinate language? Please explain your answer! 


2.15 Let us assume that you have been asked to help modeling the flow of visitors 
in the hypothetical Museum of Fine Future Information Nuggets (MUFFIN). We 
consider a steady state with no visitors entering or exiting the museum. The museum 
will have three exhibition halls. In front of each hall, there is space for a waiting line. 
The exit of this space is connected to the entry of the hall. Each of the hall exits is 
connected to each entry of the waiting spaces. Visitors leaving one of the halls are 
free to chose any of the other halls as their next one. We assume that each hall can 
be described as a process in a meaningful way, with some randomness of the time 
that a visitor stays in a hall. Assume that you would like to model this situation is 
SDL. Show a diagram with explicit processes and FIFO queues! 


2.16 Download the levi simulation software for KPNs [496], and develop a KPN 
model computing Fibonacci numbers in a distributed fashion (i.e., just using a single 
KPN node is illegal). 


2.17 Which three types of Petri nets did we discuss in this book? 


2.18 One of the types of Petri nets allows several non-distinguishable tokens per 
place. Which components are used in a mathematical model of such nets? Hint: 
NEP yiexcsunt ees ) 
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2.19 Draw the following condition/event system: N = (C, E, F), given 


e Conditions: C = {c 1, C2, C3, C4}, 

e Events: E = {e1, e2, e3}, 

e Relation: F = {(c1, e1), (c1, e2), (e1, €2), (e1, c3), (e2, c2), (e2, c3), (e2, c4), 
(c2, €3), (c3, €3), (C4, €3), (€3, c1), (e3, C4)} 


Specifiy the precondition of e3 as well as the postcondition of e1. Is N simple or/and 
pure? Given it is not, which edge(s) need(s) to be removed in order to turn N into a 
pure net? Substantiate or prove your answers concisely. 


2.20 What does a compact model of the dining philosopher’s problem look like? 


2.21 CSA theory leads to 2, 3, and 4 logic strengths, corresponding to 4, 7, and 10 
logic values. How many strengths and values are we using in IEEE 1164? Show the 
partial order among the values of IEEE 1164 in a diagram! Which of the values of 
TIEFE 1164 are not included in the partial order, and what is the meaning of these? 


2.22 Which of the following circuits can be modeled with IEEE 1164: comple- 
mentary CMOS outputs, outputs with a depletion transistor, open collector outputs, 
tristate outputs, or precharging on buses (if depletion transistors are used as well)? 


2.23 Suppose that a bus as shown in Fig. 2.78 is given. Rectangles containing an & 
sign denote AND-gates. Which of the IEEE 1164 values will be on the bus if both 
enable inputs are set to 'Q' (enal = ena2 ='@')? Which of the IEEE 1164 values 
will be on the bus if ena1='0', ena2 ='1', and f2 ='1'? 


2.24 Which of the following languages use asynchronous message passing: State- 
Charts, SDL, VHDL, CSP, Petri nets, or MPI? 


2.25 Which of the following languages use a broadcast mechanism for updating 
variables: StateCharts, SDL, or Petri nets? 


2.26 Which of the following diagram types are supported by UML: sequence 
charts, record charts, Y-charts, use cases, activity diagrams, or circuit diagrams? 


Fig. 2.78 Bus driven by VDD 
tristate outputs t t 
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2.27 UML™ is a frequently used modeling technique. In the table below, enter 
models of computation for the components in the left column and for communica- 
tion in the top row. Then enter as many UML diagram types as feasible into the 
remaining table cells. 


Communication/ 


organization of components 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
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the copyright holder. 


Chapter 3 A 
Embedded System Hardware EEM 


In this chapter, we will present the interface between the physical environment and 
information processing (the cyphy-interface) together with the hardware required 
for processing, storing, and communicating information. Due to considering CPS, 
covering the cyphy-interface is indispensable. The need to cover other hardware 
components as well is a consequence of their impact on the performance, timing 
characteristics, power consumption, safety, and security. 

Regarding the cyphy-interface, we will present circuits for sampling and digi- 
tization of physical quantities as well as for the reverse process. We will present 
the sampling theorem and its impact. Regarding information processing, we will 
provide details of efficient hardware, in particular of digital signal processors, 
general-purpose computing on graphics processors, multi-core systems, and field 
programmable gate arrays (FPGAs). With respect to information storage, we will 
explain the memory hierarchy as it is used in embedded systems. We will also 
explain if and how existing communication technologies can be used. 

Electronic information processing requires electrical energy. Accordingly, this 
chapter includes a section on the generation (e.g., harvesting), storage, and efficient 
use of electrical energy in embedded systems, including battery and energy con- 
sumption models. This chapter closes with a survey on the challenges of supporting 
security in hardware. 


3.1 Introduction 


Frequently, hardware designs are reused, either in the form of real hardware 
components or in the form of intellectual property (IP). The reuse of available hard- 
and software components is at the heart of the platform-based design methodology 
(see also p. 296). This methodology is seen as a key method for mastering the 
growing complexity of embedded systems. Consistent with the need to consider 
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Fig. 3.1 Simplified design information flow 


> display < energy source 
i 


A/D converter + > information 
sample-and-hold processing 


>Í D/A conversion / < energy storage | 
pulse-width modulation 


y 
actuators = heat removal 


(physical) 
environment 


sensors 


Fig. 3.2 Hardware in the loop 


available hardware components and with the design information flow shown in 
Fig. 3.1, we are now going to describe some of the essentials of embedded system 
hardware. 

Hardware for embedded systems is much less standardized than hardware for 
personal computers. Due to the huge variety of embedded system hardware, it 
is impossible to provide a comprehensive overview of all types of hardware 
components. Nevertheless, we will try to provide a survey of some of the essential 
components which can be found in most systems. In many cyber-physical systems, 
especially in control systems, hardware is used in a loop (see Fig. 3.2). We will use 
this loop to structure the presentation of components in this chapter. In this (con- 
trol) loop, information about the physical environment is made available through 
sensors. Typically, sensors generate continuous sequences of analog values. In this 
book, we will restrict ourselves to information processing where digital computers 
process discrete sequences of values. Appropriate conversions are performed by two 
kinds of circuits: sample-and-hold circuits and analog-to-digital converters (ADCs). 
After such conversion, information can be processed digitally. Generated results can 
be displayed and also be used to control the physical environment through actuators. 
Since many actuators are analog actuators, conversion from digital to analog signals 
may also be needed. We will see how this conversion can be achieved either 
by digital-to-analog converters (DACs) or indirectly by pulse-width modulation 
(PWM). 

Due to the prevailing electronic information processing, we assume that we 
require electrical energy. Some source of this energy must be available. If our energy 
source does not provide energy permanently, we may need to store energy, e.g., in 
rechargeable batteries or capacitors. During system operation, much of the electrical 
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Fig. 3.3 Acceleration sensor 
(courtesy S. Biittgenbach, 
IMT, TU Braunschweig), 
©TU Braunschweig, 
Germany 


energy will be converted into thermal energy (heat). It may be necessary to remove 
thermal energy from the system. 

This model is obviously appropriate for control applications. For other applica- 
tions, it can be employed as a first-order approximation. In the following, we will 
describe essential hardware components of embedded and cyber-physical systems 
following the structure of Fig. 3.2. 


3.2 Input: Interface Between Physical and Cyber-World 


3.2.1 Sensors 


Sensors are key components of the cyphy-interface. Sensors can be designed for 
virtually every physical quantity. There are sensors for weight, velocity, accelera- 
tion, electrical current, voltage, temperature, etc. A wide variety of physical effects 
can be exploited in the construction of sensors [151]. Examples include the law 
of induction (generation of voltages in an electric field) and photoelectric effects. 
There are also sensors for chemical substances [152]. 

Recent years have seen the design of a huge range of sensors, and much of the 
progress in designing smart systems can be attributed to modern sensor technology. 
The availability of sensors has enabled the design of sensor networks (see, e.g., 
Tiwari et al. [543]), a key element of the Internet of Things. It is impossible to 
cover this subset of cyber-physical hardware technology comprehensively, and we 
can only give characteristic examples: 


e Acceleration sensors: Figure 3.3 shows a small sensor manufactured using 
microsystem technology. The sensor contains a small mass in its center. When 
accelerated, the mass will be displaced from its standard position, thereby 
changing the resistance of the tiny wires connected to the mass. 

Acceleration sensors are included in the powerful inertial measurement units 
(IMUs) (see, e.g., Siciliano et al. [487], Section 20.4). They contain gyros and 
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accelerometers, and they capture up to six degrees of freedom, comprising 
position (x, y, and z) and orientation (roll, pitch, and yaw) [575]. They are 
contained in airplanes, cars, robots, and other products in order to provide inertial 
navigation. 

e Image sensors: There are essentially two kinds of image sensors: charge-coupled 
devices (CCDs) and CMOS sensors. In both cases, arrays of light sensors are 
used. The architecture of CMOS sensor arrays is similar to that of standard 
memories: individual pixels can be randomly addressed and read out. CMOS 
sensors use standard CMOS technology for integrated circuits. Due to this, 
sensors and logic circuits can be integrated on the same chip. This allows 
some preprocessing to be done already on the sensor chip, leading to so-called 
smart sensors. CMOS sensors require only a single standard supply voltage and 
interfacing in general is easy. Therefore, CMOS-based sensors can be cheap. 

In contrast, CCD technology is optimized for optical applications. In CCD 
technology, charges must be transferred from one pixel to the next until they can 
finally be read out at an array boundary. This sequential charge transfer also gave 
CCDs their name. For CCD sensors, interfacing is more complex. 

Selecting the most appropriate image sensor depends on several constraints, 
which change as technology evolves. The image quality of CMOS sensors has 
been improved over the recent years, and the initial image superiority of CCDs 
became questionable. Therefore, achieving a good image quality is feasible with 
CCD and with CMOS sensors. Due to their faster readout speed, CMOS sensors 
are preferred for cameras with live view modes or video recording functionality 
[404]. Also, CMOS sensors are preferred for low-cost devices and if smart 
sensors are to be designed. Several application areas for CCDs have disappeared, 
but they are still used in areas such as scientific image acquisition. 

e Biometric sensors: Demands for higher security standards as well as the need 
to protect mobile and removable equipment have led to an increased interest in 
authentication. Due to the limitations of password-based security (e.g., stolen 
and lost passwords), biometric sensors and biomedical authentication receive 
attention. Biometric authentication tries to identify whether or not a certain 
person is actually the person she or he claims to be. Methods for biometric 
authentication include iris scans, fingerprint sensors, and face recognition. False 
accepts as well as false rejects are an inherent problem of biometric authenti- 
cation (see definitions on p. 257). In contrast to password-based authentication, 
exact matches are not possible. 

e Artificial eyes: Artificial eye projects have received significant attention. Some 
projects have an impact on the eye, but others provide vision in an indirect way. 
For example, the Dobelle Institute experimented with a camera attached to a 
computer sending electrical pulses to a direct brain contact [532]. More recently, 
the less invasive translation of images into audio has been preferred. 

e Radio frequency identification (RFID): RFID technology is based on the 
response of a tag to radio frequency signals [226]. The tag consists of an 
integrated circuit and an antenna, and it provides its identification to RFID 
readers. The maximum distance between tags and readers depends on the type 
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of the tag. The technology is used to identify objects, animals, or people and is a 
key enabler for the Internet of Things. 

e Automotive sensors: Today’s cars contain a large number of sensors. This 
includes rain sensors, tire pressure sensors, collision sensors, etc. The overall 
goal is to provide comfort and safety to the passengers and the environment. 

e Other sensors: Other common sensors include thermal sensors, engine control 
sensors, Hall effect sensors, and many more. 


Machine learning algorithms [188, 204, 453, 560] may need to be used to obtain 
meaningful information from noisy sensor readouts. 
Sensors are generating signals. Mathematically, the following definition applies: 


Definition 3.1 A signal o is a mapping from a time domain Dr to a value domain 
Dy: 


o : Dr > Dy 


Signals may be defined over a continuous or a discrete time domain as well as over 
a continuous or a discrete value domain. 


3.2.2 Discretization of Time: Sample-and-Hold Circuits 


All known digital computers work in a discrete time domain Dr. This means that 
they can process discrete sequences or streams of values. Hence, incoming signals 
over the continuous time domain must be converted to signals over the discrete time 
domain. This is the purpose of sample-and-hold circuits. These are included in 
the cyphy-interface. Figure 3.4 (left) shows a simple sample-and-hold circuit. In 
essence, the circuit consists of a clocked transistor and a capacitor. The transistor 
operates like a switch. Each time the switch is closed by the clock signal, the 
capacitor is charged so that its voltage h(t) is practically the same as the incoming 
voltage e(t). After opening the switch again, this voltage will remain essentially 
unchanged until the switch is closed again. Each of the values stored on the capacitor 
can be considered as an element of a discrete sequence of values h(t), generated 
from a continuous function e(t) (see Fig. 3.4 (right)). If we sample e(t) at times 
{ts}, then h(t) will be defined only at those times. 

An ideal sample-and-hold circuit would be able to change the voltage at the 
capacitor in an arbitrarily short amount of time. This way, the input voltage at a 
particular instance in time could be transferred to the capacitor, and each element 
in the discrete sequence would correspond to the input voltage at a particular point 
in time. In practice, however, the transistor has to be kept closed for a short time 
window in order to really charge or discharge the capacitor. The voltage stored on 
the capacitor will then correspond to a voltage reflecting that short time window. 
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Fig. 3.4 Sample-and-hold phase: left, circuit; right, signals 
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Fig. 3.5 Approximation of a square wave by sine waves for K = | (left) and K = 3 (right) 


3.2.3 Fourier Approximation of Signals 


Would we be able to reconstruct the original signal e(t) from the sampled signal 
h(t)? In order to answer this question, we revert to the fact that arbitrary signals can 
be approximated by summing (possibly phase-shifted) sine functions of different 
frequencies (Fourier approximation).! 


Example 3.1 A square wave can be approximated by Eq. (3.1) [440]: 


K 


4. nk 
O= È (= sin) (3.1) 


k=1,3,5,7,9,... 


In this equation, T is the period and approximation is improved for increasing K. 
Figures 3.5 and 3.6 visualize Eq. (3.1). 


'This presentation is based on the assumption that a comprehensive coverage of Fourier approx- 
imations cannot be included in our course. Therefore, only the impact of these approximations is 
demonstrated by examples. Knowing the theory behind these examples would be beneficial (see, 
e.g., http://www.dspguide.com). 
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Fig. 3.6 Approximation of a square wave by sine waves for K = 7 (left) and K = 11 (right) 


The larger difference between the square wave and its approximation at the 
jump discontinuities of the square wave (best visible for K=11) is called Gibbs 
phenomenon [440]. V 


Definition 3.2 A signal transformation Tr is linear if for all signals e; (t) and e2 (t) 
we have 


Tr(e1 + e2) = Tr (e1) + Tr (e2) (3.2) 


Next, we restrict ourselves to linear systems. Then, in order to answer the question 
raised above, we study sampling each of the sine waves independently. 


Example 3.2 Consider signals described by either of the two functions e3 or e4: 


. (27t . [27t 
e3 (t) = aint) + 0.5 sin (=) (3.3) 


. (2ut . (2nt . [2nt 
e4(t) = sin (=) + 0.5 sin (=) + 0.5 sin (=) (3.4) 


The sine waves used in these functions have periods of T = 8, 4, and 1, respectively 
(this can be seen by comparing these sine waves with those of Eq. (3.1)). A graphical 
representation of these functions is shown in Fig. 3.7. Suppose that we will be 
sampling these signals at integer times. It then so happens that both signals have the 
same value whenever they are sampled. Obviously, it is not possible to distinguish 
between e3(t) and e4(t) if we sample at these instances in time and if only the 
sampled signal is available. V 


In general, sampled signals will not allow us to distinguish between some slow 
signal e3 (t) and some other faster varying signal e4(t) if e3 (t) and e4(t) are identical 
each time we are sampling the signals. The fact that two or more unsampled signals 
can have the same sampled representation is called aliasing. We are not sampling 
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Fig. 3.7 Visualization of 2 
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e4(t) frequently enough to notice, for example, that it has slope changes between 
integer times. So, from this counterexample we can conclude that reconstruction 
of the original unsampled signal is not feasible unless we have additional 
knowledge about the frequencies or the waveforms present in the input signal. 

How frequently do we have to sample signals to be able to distinguish between 
different sine waves? Let us assume that we are sampling the input signal at constant 
time intervals, such that T, is the sampling period: 


Vs: T; = ts41 — ts (3.5) 
Let 
fs = : (3.6) 
s =m T; s 


be the sampling rate or sampling frequency. Then, sampling theory provides us 
with the following theorem (see, e.g., [440]): 


Theorem 3.1 (Sampling Theorem) Given the above definitions of variables, 
aliasing is avoided if we restrict the frequencies of the incoming signal to less 
than half of the sampling frequency fs: 


T, 
Ts < s where Ty is the period of the “fastest” sine wave, or (3.7) 


fs > 2fy where fy is the frequency of the “fastest” sine wave (3.8) 


Definition 3.3 fy is called the Nyquist frequency; fs is the sampling rate. 
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The condition in Eq. (3.8) is called sampling criterion, and sometimes the Nyquist 
sampling criterion. 

Therefore, reconstruction of input signals e(t) from discrete samples h(t) can be 
successful only if we make sure that higher-frequency components such as the one 
in e4(t) are removed. This is the purpose of anti-aliasing filters. Anti-aliasing filters 
are placed in front of the sample-and-hold circuit (see Fig. 3.8). 

Figure 3.9 demonstrates the ratio between the amplitudes of the output and the 
input waves as a function of the frequency for this filter. Ideally, such a filter would 
remove all frequencies at and above half the sampling frequency and keep all other 
components unchanged. This way, it would convert signal e4(t) into signal e3 (t). 

In practice, such ideal filters (so-called brick-wall filters) do not exist.2 Real- 
izable filters will already start attenuating frequencies smaller than f,/2 and will 
still not eliminate all frequencies larger than fs/2 (see Fig. 3.9). Attenuated high- 
frequency components will exist even after filtering. For frequencies smaller than 
fs/2, there may also be some “overshooting,” i.e., frequencies for which there is 
some amplification of the input signal. 

The design of good anti-aliasing filters is an art by itself. This art has been 
studied, for example, in great detail for high-quality audio equipment, involving 
detailed hearing tests. Many of the perceived differences between high-quality 
equipment have been attributed to the design of such filters. 


3.2.4 Discretization of Values: Analog-to-Digital Converters 


Since we are restricting ourselves to digital computers, we must also replace signals 
that map time to a continuous value domain Dy by signals that map time to a 


This would require knowing the signal to be filtered for an infinite amount of time. 
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Fig. 3.10 Flash ADC: left, schematic; right, w as a function of h 


discrete value domain D‘,. This conversion from analog-to-digital values is done by 
analog-to-digital converters (ADCs). There is a large range of ADCs with varying 
speed/precision characteristics. Typically, fast ADCs have a low precision and high- 
precision converters are slow. 

We will present several converters in the next subsections. 


Flash ADC 


This type of ADCs uses a large number of comparators. Each comparator has two 
inputs, denoted as + and -. If the voltage at input + exceeds that at input -, the output 
corresponds to a logical '1', and it corresponds to a logical '@' otherwise.* 

In the ADC, all - inputs are connected to a voltage divider. If input voltage h(t) 
exceeds 2 Vref, the comparator at the top of Fig. 3.10 (left) will generate a '1'. The 
encoder at the output of the comparators will try to identify the most significant '1' 
and will encode this case as the largest output value. The case h(t) > Vef should 
normally be avoided since V,ef is typically close to the supply voltage of the circuit 
and input voltages exceeding the supply voltage can lead to electrical problems. In 
our case, input voltages larger than V,ef generate the largest digital value as long as 
the converter does not fail due to the high input voltage. 

Now, if input voltage h(t) is less than a Vref, but still larger than i Vref the 
comparator at the top of Fig. 3.10 will generate a 'Q', while the next comparator 
will still signal a '1'. The encoder will encode this as the second largest value. 

Similar arguments hold for cases iVre f < h(t) < i Vref and 0 < h(t) < Vref. 
which will be encoded as the third largest and the smallest value, respectively. 


3In practice, the case of equal voltages is not relevant, as the actual behavior for very small 
differences between the voltages at the two inputs depends on many factors (like temperatures, 
manufacturing processes, etc.) anyway. 
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Figure 3.10 (right) shows the relation between input voltages and generated digital 
values. 

The outputs of the comparators encode numbers in a special way: if a certain 
comparator output is equal to '1', then all the less significant outputs are all 
equal to '1'. The encoder transforms this representation of numbers into the usual 
representation of natural numbers. The encoder is actually a so-called priority 
encoder, encoding the most significant input number carrying a '1' in binary.* 

The circuit can convert positive analog input voltages into digital values. 
Converting both positive and negative voltages and generating two’s complement 
numbers requires some extensions. 

One nice property of the flash ADC is the fact that it is automatically monotonic: 
For any increase in the analog voltage from 0 to the maximum, the corresponding 
digital value increases as well. This property is maintained even if the actual value 
of the resistors would deviate from the nominal value. However, such a deviation 
would have an impact on the precision of the linear relation expected between 
analog and digital values. 

Unfortunately, the chain of resistors forms a conducting path, which exists even 
if the converter is not used. This could make it impossible to use this converter for 
low-power equipment. 

In general, ADCs are also characterized by their resolution. This term has several 
different but related meanings [15]. The resolution (measured in bits) is the number 
of bits produced by an ADC. For example, ADCs with a resolution of 16 bits are 
needed for many audio applications. However, the resolution is also measured in 
volts, and in this case it denotes the difference between two input voltages causing 
the output to be incremented by 1: 


VFSR 
Q= (3.9) 
n 
where: Q : is the resolution in volts per step, 


Vrsr_: is the difference between the largest and the smallest voltage and 


n : isthe number of voltage intervals (not the number of bits). 


Example 3.3 For the ADC of Fig. 3.10, the resolution is 2 bits or IVre f volts, if we 
assume V,ef as the largest voltage. V 


The key advantage of the flash ADC is its speed. It does not need any clock. 
The delay between the input and the output is very small, and the circuit can be 
used easily, for example, for high-speed video applications. The disadvantage is its 
hardware complexity: we need n — 1 comparators in order to distinguish between 
n values. Imagine using this circuit in generating digital audio signals for CD 


4Such encoders are also useful for finding the most significant '1' in the mantissa of floating-point 
numbers. 
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recorders. We would need 2!6 — 1 comparators! High-resolution ADCs must be 
built differently. 


Successive Approximation 


Distinguishing between a large number of digital values is possible with ADCs 
using successive approximation. The circuit is shown in Fig. 3.11. 

The key idea of this circuit is to use binary search. Initially, the most significant 
output bit of the successive approximation register is set to '1'; all other bits are 
set to 'Q'. This digital value is then converted to an analog value, corresponding to 
0.5: the maximum input voltage.” If h(t) exceeds the generated analog value, the 
most significant bit is kept at '1'; otherwise it is reset to 'Q'. 

This process is repeated with the next bit. It will remain set to '1' if the input 
value is either within the second or the fourth quarter of the input value range. The 
same procedure is repeated for all the other bits. 

Figure 3.12 shows an example. Initially the most significant bit is set to '1'. 
This value is kept, since the resulting V_ is less than h(t). Then, the second most 
significant bit is set to ''1'. It is reset to '@', since the resulting V_ is exceeding h(t). 
Next, the third most significant bit is tried. It is set to '1', and this value is kept. 
Finally, the least significant bit is also set, and it remains set after the comparison has 
been completed. Obviously, A(t) must be constant during the conversion, otherwise 


Fortunately, the conversion from digital-to-analog values (D/A conversion) can be implemented 
very efficiently and can be very fast (see p. 180). 
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Fig. 3.13 Pipelined ADC [291] 


the whole procedure would be jeopardized. This requirement is met if we employ a 
sample-and-hold circuit as shown above. The resulting digital signal is called w(t). 

The key advantage of the successive approximation technique is its hardware 
efficiency. In order to distinguish between n digital values, we need [log2(n)] bits 
in the successive approximation register and the D/A converter. The disadvantage is 
its speed, since it needs O(/og2(n)) steps. These converters can therefore be used for 
high-resolution applications, where moderate speeds are required. Examples include 
audio applications. 


Pipelined Converters 


These converters consist of a chain of converters, where each stage in the chain is 
in charge of converting a few bits (see Fig. 3.13). Each stage passes the remaining 
residue of the voltage to the next stage (if any). For example, each stage could 
convert a single bit and subtract the corresponding voltage. The resulting residue 
would typically be scaled up by a factor of two (in order to avoid too small voltages) 
and be passed on to the next stage. Typically, each stage would include a flash 
ADC of a few bits and a D/A converter to compute the voltage to be subtracted. 
Resulting digital values must be aligned in time. Required hardware resources 
increase linearly with the number of bits. With this structure, a good throughput 
can be achieved, but the latency is larger than for flash converters. 


Other Converters 


Integrating converters use (at least) two phases for the measurement. During the 
first phase of length t;, the integral of the input voltage over time is computed.° 
For constant inputs, the resulting value V,,; is proportional to the input voltage 
(Vout ~ Vin * tı). During the second phase, this value is decreased at a constant rate, 


This can be done with a capacitor in the feedback loop of an operational amplifier (see p. 397). 
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Fig. 3.14 Comparison of the speed/resolution characteristics of various ADCs [558] 


and the time to reach a value of zero is counted. The final count is proportional 
to the input voltage. Hence, using proper scaling, the final count represents the 
input voltage. If the input voltage contains some noise, its impact is likely to 
be averaged out during the first integration phase. Hence, these converters are 
capable of compensating noise. They are typically found in slow, high-resolution 
multimeters. 

For folding ADCs, the input voltage range is divided into 2” segments [100, 
321]. A coarse-grained converter detects the segment of the current input voltage, 
yielding the m most significant output bits. A fine-grained converter computes the 
value within a segment, yielding the less significant output bits. 

For delta-sigma ADCs (AX ADCs), the name indicates that signal differences 
(As) are encoded and that they are summed up (£). A description of these converters 
is beyond the scope of this book. For details refer to Khorramabadi [292]. 


Comparison of ADCs 


Figure 3.14 provides an overview of the speed/resolution trade-offs of ADCs, using 
a trade-off analysis of Vogels et al. [558]. Flash ADCs are clearly the fastest but 
provide only a small resolution. Pipelining is frequently superior to successive 
approximation. Another overview of ADCs is provided by IEEE TV [437]. 
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Figure 3.15 shows the behavior of a flash ADC when the input signal is that 
of Eq. (3.3). Only the behavior for a positive input signal is shown. The figure 
includes the voltage corresponding to the digital value, the original voltage, and 
the difference between the two. Obviously, the converter is “truncating” the digital 
representation of the analog signal to the number of available bits (i.e., the digital 
value is always less than or equal to the analog value). This is a consequence of 
the way in which the flash converter is doing comparisons. “Rounding” converters 
would need an internal correction by “half a bit.” Effectively, the digital signal 
encodes values corresponding to the sum of the original analog values and the 
difference w(t) — h(t). This means, it appears as if the difference between the 
two signals had been added to the original signal. This difference is a signal 
called quantization noise: 


Definition 3.4 Let h(t) be some analog signal. Let w(t) be derived from h(t) by 
quantization. The difference between the two is called quantization noise: 


quantization noise(t) = w(t) — h(t) < Q (3.10) 


Increasing the resolution of the ADC decreases quantization noise. The impact of 
quantization noise is captured in the definition of the signal-to-noise ratio (SNR), 
measured in decibels (tenth of a bel, named after Alexander G. Bell). 
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Definition 3.5 The SNR is defined as follows: 


. : power of the “useful” signal 
SNR (in dB = decibels) = 10 * log - - (3.11) 
power of the noise signal 


voltage of the “useful” signal 
= 20 « log — (3.12) 
voltage of the noise signal 


We have used that, for any given impedance R, the power of a signal is proportional 
to the square of the voltage. Decibels are no physical units, since the SNR is 
dimensionless. 

For any signal h(t), the power of the quantization noise is equal to œ x Q, where 
a < 1 depends on the waveform of h(t). If h(t) can always be represented exactly 
by a digital value, then a = 0. If h(t) is always “just a little” below the next value 
that can be represented, œ may be close to 1. 


Example 3.4 The SNR of 16 bit CD audio is (for a ~ 1) about 20 x log(2!6) = 
96 dB. Values of a < 1 and imperfect ADCs change this number. V 


3.3 Processing Units 


Let us now discuss the next hardware element in the loop of Fig.3.2, pro- 
cessing units. For information processing in embedded systems, we will con- 
sider ASICs (application-specific integrated circuits) using hardwired multiplexed 
designs, reconfigurable logic, and programmable processors. We will consider 
ASICs first. 


3.3.1 Application-Specific Integrated Circuits (ASICs) 


For high-performance applications and for large markets, application-specific inte- 
grated circuits (ASICs) can be designed. In general, ASICs are very energy-efficient 
(see Sect. 3.7.3 on p. 193). However, the cost of designing and manufacturing 
such chips is quite high. The cost of the mask set (which is used for transferring 
geometrical patterns onto the chip) has grown.” 

It is feasible to decrease this cost by using less advanced semiconductor 
fabrication technologies and by using multi-project wafers (MPW) containing 
several designs. But there is a lack of flexibility: correcting design errors typically 
requires a new mask set and a new fabrication run (unless the ASIC contains 


TIn 2017, http://anysilicon.com/semiconductor-wafer-mask-costs/ mentioned an average cost of 
about $ 1.5M for a 28 nm technology. 
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processors with writable memories). This approach also has to cope with potentially 
large design efforts requiring dedicated skills and expensive tools. Therefore, 
ASICs are appropriate only under special circumstances, like large market volumes, 
ultimate energy efficiency demands, special voltage or temperature ranges, mixed 
analog/digital signals, or security-driven designs. Hence, the design of ASICs is not 
covered in this book. 


3.3.2 Processors 


The key advantage of processors is their flexibility. With processors, the behavior 
of embedded systems can be changed by changing the software running on those 
processors. Changes of the behavior may be required in order to correct design 
errors, to update the system to a new standard, or to add features. Because of 
this, processors have found widespread use in embedded systems. In particular, 
processors which are available commercially “off-the-shelf” (COTS) have become 
very popular. 

Embedded processors must be used in a resource-aware manner, i.e., we need 
to care about resources required for running applications on them. Furthermore, 
they do not need to be instruction set compatible with commonly used personal 
computers (PCs) or servers. Therefore, their architectures may be different from 
those processors. Efficiency has different aspects (see p. 13), some of which are 
discussed next. 


Energy Efficiency 


The energy E for an application is related to the power P as a function of time, 
since 


E= f Par (3.13) 


Let us assume that we start with some design having a power consumption of Po(t), 
leading to an energy consumption of 


to 
Eo = f Po(t)dt 
0 


after to units of execution time. Suppose that a modified design finishing computa- 
tions already at time t; comes with a power consumption of P(t) and an energy 
consumption of 
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If Pı (t) is not too much larger than P(t), then a reduction of the execution time also 
reduces the energy consumption. However, in general this is not necessarily always 
true. The situation is also shown in Fig. 3.16: E; may be smaller than Eo, but E; can 
also be larger than Eo. So, if the energy consumption is to be minimized, it should 
be used as a cost function. Just minimizing the execution time can be misleading. 

Minimization of power and energy consumption are both important. Power 
consumption has an effect on the size of the power supply, the design of the 
voltage regulators, the dimensioning of the interconnect, and short-term cooling. 
Minimizing the energy consumption is required especially for mobile applications, 
since battery technology is only slowly improving and since the cost of energy may 
be quite high. Also, a reduced energy consumption decreases cooling requirements 
and improves the reliability (since the lifetime of electronic circuits decreases for 
high temperatures). 

Next, we would like to demonstrate that for CMOS technology, it is preferable 
to replace high-speed sequential computations by reduced speed parallel computa- 
tions. This is shown by—first of all—considering the power consumption of CMOS 
devices. The dynamic power consumption is the power consumption caused by 
switching (in contrast to the static power consumption which exists even if no 
switching takes place). The average dynamic power consumption P4yn of CMOS 
circuits is given by Chandrakasan et al. [90] 


Payn =a Cr Voy f (3.14) 
where œ is the switching activity, Cz is the load capacitance, Vag is the supply 


voltage, and f is the clock frequency. This means that the power consumption of 
CMOS processors increases (at least)? quadratically with the supply voltage Vga. 


8In practice, the increase may actually come with a larger exponential. 
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The delay of CMOS circuits can be approximated as [90] 


Vad 


A = kC, ———— 
(Vaa — Vr)? 


(3.15) 


where k is a constant and V; is the threshold voltage. V; has an impact on the 
transistor input voltage required to switch the transistor on. For example, for a 
maximum supply voltage of Vad,max = 3.3 V, V; may be in the order of 0.8 V. 
Consequently, the maximum clock frequency is a function of the supply voltage. 
However, decreasing the supply voltage reduces the power quadratically, while the 
run-time of algorithms is only linearly increased (ignoring the effects of the memory 
system). 

We can use this to reduce the amount of energy required for a certain amount 
of computations. Let us assume that we are initially performing computations 
sequentially at voltage Vag, constant power P, clock frequency f, run-time of t, 
and energy consumption E = P xt. 

Now let us assume that we are moving toward executing £ operations in parallel. 
Due to parallel execution, we can extend the time for each operation by a factor of 
£. In turn, we can also reduce frequency f by a factor of 6 and use a new frequency 


f 
f => (3.16) 
B 
This allows us to also reduce the voltage to a new voltage 
V, 
vj = (3.17) 
aa B 


This reduces the power P? per operation quadratically: 


P? = (3.18) 


p? 
Due to executing $ operations in parallel, the overall power P’ can be computed as 


Pogras (3.19) 
B 


The time t’ to execute operations in parallel is the same as the time to compute them 
sequentially (t/ = t). Hence, the energy to execute the operations in parallel is 


E'=P'*xt=— (3.20) 
B 


We conclude that it is more energy-efficient to execute 6 operations in parallel 
instead of computing them sequentially. However, our derivation contains a number 
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of approximations. On the one hand, power may be depending even cubically 
on the voltage, and we have ignored the fact that memory speed is frequently a 
limiting constraint. Faster processor clock speeds might just lead to more waiting for 
memory accesses (but there may be also conflicts for memory access from multiple 
cores). The energy would decrease quadratically if we would be able to keep the 
power consumption independent of the level of parallelism. On the other hand, we 
need to be able to find 6 operations which can be executed in parallel. Overall, 
we keep in mind that parallel execution is a means for deriving energy-efficient 
implementations, regardless of which hardware technology we are using. 

Architectures must be optimized for their energy efficiency, and we must make 
sure that we are not losing efficiency in the software generation process. For 
example, compilers generating 50% overhead in terms of the number of cycles will 
take us further away from the efficiency of ASICs, possibly by even more than 
50%, if the supply voltage and the clock frequency must be increased in order to 
meet timing deadlines. 

There is a large amount of techniques available that can make processors 
energy-efficient, and energy efficiency should be considered at various levels of 
abstraction, from the design of the instruction set down to the design of the chip 
manufacturing process [77]. Gated clocking and power gating are examples of such 
techniques. With gated clocking, parts of the processor are disconnected from the 
clock during idle periods. In a similar way, the power can be disconnected for some 
components. For example, direct memory access (DMA) hardware or bus bridges 
can be disconnected if they are not needed. Also, there are attempts, to get rid of 
the clock for major parts of the processor altogether. There are two contrasting 
approaches: globally synchronous locally asynchronous (GSLA) processors [436] 
and globally asynchronous locally synchronous (GALS) processors [262]. Further 
information about low-power design techniques is available in a book by E. Macii 
[359] and in the PATMOS proceedings (see http://www.patmos-conf.org/). 

At least three techniques can be applied at a rather high level of abstraction: 


e Parallel execution: According to Eq. (3.20), parallel execution is an effective 
means of improving the overall energy efficiency. 

e Dynamic power management (DPM): With this approach, processors have 
several power-saving states in addition to the standard operating state. Each 
power-saving state has a different power consumption and a different time for 
transitions into the operating state. Figure 3.17 shows the three states for the 
StrongARM SA-1100 processor. 

The processor is fully operational in the run state. In the idle state, it is just 
monitoring the interrupt inputs. In the sleep state, on-chip activity is shut down, 
the processor is reset, and the chip’s power supply is shut off [593]. A separate 
T/O power supply provides power to power manager hardware. The processor can 
be restarted by the power manager hardware by a preprogrammed wake-up event. 
Note the large difference in the power consumption between the sleep state and 
the other states, and note also the large delay for transitions from the sleep to the 
run State. 
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e Dynamic voltage and frequency scaling (DVFS): Equation (3.14) can be 
exploited in a technique called dynamic voltage and frequency scaling 
(DVFS). For example, the Crusoe™ processor by Transmeta [295] provided 
32 voltage levels between 1.1 and 1.6V, and the clock could be varied 
between 200 MHz and 700MHz in increments of 33 MHz. Transitions from 
one voltage/frequency pair to the next took about 20 ms. Design issues for 
DVFS-capable processors are described in a paper by Burd and Brodersen [76]. 
In 2004, Intel SpeedStep® Technology provided six different voltage/frequency 
combinations for Pentium’ M processors [246]. More recent processors include 
more comprehensive mechanisms for power management. 


Code Size Efficiency 


Minimizing the code size is very important for embedded systems, since large 
hard disk drives (HDDs) or solid-state disks (SSDs) are typically not available and 
since the capacity of memory is typically also very limited.? This is even more 
pronounced for systems on a chip (SoCs). For SoCs, the memory and processors are 
implemented on the same chip. In this particular case, memory is called embedded 
memory. Embedded memory may be more expensive to fabricate than separate 
memory chips, since the fabrication processes for memories and processors must be 
compatible. Nevertheless, a large percentage of the total chip area may be consumed 
by the memory. There are several techniques for improving the code size efficiency: 


e CISC machines: Standard RISC processors have been designed for speed, 
not for code size efficiency. Earlier complex instruction set processors (CISC 
machines) were actually designed for code size efficiency, since they had to 
be connected to slow memories. Caches were not frequently used. Therefore, 
“old-fashioned” CISC processors are finding applications in embedded systems. 
ColdFire processors [170], which are based on the Motorola 68000 family of 
CISC processors, are an example. 

e Compression techniques: In order to reduce the amount of silicon needed for 
storing instructions as well as in order to reduce the energy needed for fetching 


°The availability of large flash memories and 3D integration make memory size constraints less 
tight. 
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Fig. 3.18 Schemes for instruction fetch: left, uncompressed; right, compressed 


these instructions, instructions are stored in memory in compressed form. 
This reduces both the area and the energy necessary for fetching instructions. 
Due to the reduced bandwidth requirements, fetching can also be faster. A 
(hopefully small and fast) decoder is placed between the processor and the 
(instruction) memory in order to generate the original instructions on the fly (see 
Fig. 3.18 (right)).!° Instead of using a potentially large memory of uncompressed 
instructions, we are storing the instructions in a compressed format. 
The goals of compression can be summarized as follows: 


— We would like to save ROM and RAM areas, since these may be more 
expensive than the processors themselves. 

— We would like to use some encoding technique for instructions and possibly 
also for data with the following properties: 


There should be little or no run-time penalty for these techniques. 
Decoding should work from a limited context (it is, e.g., impossible to read 
the entire program to find the destination of a branch instruction). 

Word sizes of the memory, of instructions, and of addresses must be taken 
into account. 

Branch instructions branching to arbitrary addresses must be supported. 
Fast encoding is only required if writable data is encoded. Otherwise, fast 
decoding is sufficient. 


There are several variations of this scheme: 


— For some processors, there is a second instruction set. This second instruc- 
tion set has a narrower instruction format. An example of this is the ARM® 
processor family. The original ARM instruction set is a 32 bit instruction set. 
Most ARM processors also provide a second instruction set, with 16 bit wide 
instructions, called THUMB instructions. THUMB instructions are shorter, 


'0We continue denoting multiplexers, arithmetic units, and memories by shape symbols, due to 
their widespread use in technical documentation. For memories, we adopt shape symbols including 
an explicit address decoder (included in the shape symbols for the ROMs on the right). These 
decoders identify the address input. 
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since they do not support predication,!! use shorter and less register fields, 


and use shorter immediate fields (see Fig. 3.19). 

THUMB instructions are dynamically converted into ARM instructions 
while programs are decoded. THUMB instructions can use only half the 
registers in arithmetic instructions. Therefore, register fields of THUMB 
instructions are concatenated with a 'Q' bit.!? In the THUMB instruction set, 
source and destination registers are identical, and the length of constants that 
can be used is reduced by 4 bits. During decoding, pipelining is used to keep 
the run-time penalty low. 

Similar techniques also exist for other processors. The disadvantage of this 
approach is that the tools (compilers, assemblers, debuggers, etc.) must be 
extended to support a second instruction set. Therefore, this approach can be 
quite expensive in terms of software development cost. 

— A second approach is the use of dictionaries. With this approach, each 
instruction pattern is stored only once. For each value of the program counter, 
a look-up table provides a pointer to the corresponding instruction in the 
instruction table, the dictionary (see Fig. 3.20). 

This approach relies on using only very few different instruction patterns. 
Therefore, only few entries are required for the instruction table. Hence, the 
bit width of the pointers can be quite small. Many variations of this scheme 


Instructions using predicated execution have an effect only if a certain condition encoded in 
the instruction evaluates to true. This condition typically involves values stored in condition code 
registers, resulting from previous instructions. For example, instructions might have an effect 
only if a previous <=-expression was true. Predication can be used to implement if statements 
efficiently: the condition is stored in one of the condition registers, and if-statement bodies are 
implemented as predicated instructions which depend on this condition. For ARM processors, the 
condition is encoded in the first 4 bits of the instruction format. As a special case, an “always” 
condition can be encoded, like in Fig. 3.19. The more recently introduced 64 bit instruction set 
places less emphasis on predicated execution. 


Using VHDL notation (see p. 98), concatenation is denoted by an & sign, and constants are 
enclosed in quotes in Fig. 3.19. 
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Fig. 3.21 Naming conventions for signals 


exist. Some are called two-level control store [118], nanoprogramming [514], 
or procedure ex-lining [551]. 


Beszedes [52] and Latendresse [324] provide overviews of a large number 
of known compression techniques. In addition, Bonny et al. [58] published a 
Huffman-based technique. 


Execution Time Efficiency Using Digital Signal Processing as an Example 


In order to meet time constraints without having to use high clock frequencies, 
architectures can be customized to certain application domains, such as digital signal 
processing (DSP). Let us have a closer look at DSP now! In digital signal processing, 
digital filtering is a very frequent operation. Let us assume that we are extending the 
pipeline of Fig. 3.8 on p. 135. We add a processing component, supposed to perform 
filtering. Names of signals are shown in Fig. 3.21. 

Equation (3.21) describes a digital filter generating an output signal x(t) from an 
input signal w(t). Both signals are defined over the (usually unbounded) domain 
{ts} of sampling instances. We write x; instead of x(ts) and Ws—n+k+1 instead of 
ees Ts De 


'3In our notation, ao is the weight of the oldest input value. If we would define ag as the weight 
of the youngest value of w, the first term would take the more commonly used form ws—;. Our 
notation simplifies understanding the program code shown below. 
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Fig. 3.22 Internal architecture of the ADSP 2100 processor family (simplified) 


n—l1 
Xs = È Ws-n+k+1 * ak (3.21) 
k=0 


Output element xs corresponds to a weighted average over the last n signal elements 
of w and can be computed iteratively, adding one product at a time. Processors for 
DSP are designed such that each iteration can be encoded as a single instruction. 


Example 3.5 This is feasible with DSP processors from the ADSP 2100 family, 
whose architecture is shown in Fig. 3.22. 

The processor has two memories, called DM and PM. A special address generating 
unit (AGU) can be used to provide the pointers for accessing these memories in 
index registers I1@-I7. There are separate units for additions and multiplications, 
each with their own argument registers AX@, AY@, AF, MX@, MYQ, and MF. The 
multiplier is connected to a second adder in order to compute the combination 
of a multiplication and an addition (so-called MAC operation) quickly. For this 
processor, one iteration is performed in a single cycle. For this purpose, the two 
memories are allocated to hold the two arrays w and a. 

Pointers to array elements can be kept in index registers. At each iteration, the 
value contained in one of the modify registers MQ-M7 is added to the used index 
register. This is typically encoded as a side effect of accessing an array element. 

Partial sums are stored in MR. 

We would need unlimited memory space if, at each time instance ts, we would 
be storing a new value in the next unused memory element. However, a bounded 
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memory is sufficient, since we only need to access the most recent n values. This is 
feasible with a ring buffer, implemented with modulo operations for index values. 
The size of this buffer can be stored in length registers L@ to L7. 

Obviously, mentioned registers serve different purposes. Therefore, they are 
called heterogeneous registers. Heterogeneous registers are frequently found in 
DSP processors. 

In order to avoid extra cycles for testing for the end of the loop, zero- 
overhead loop instructions are frequently provided in DSP processors. With such 
instructions, a single or a small number of instructions can be executed a fixed 
number of times. 

Next, we are able to present the pipelined computation of Eq. (3.21), using 
processors from the ADSP 2100 family (adopted from [14]): 


/* outer loop over sampling times ts */ { 


L@ =n; L4 =n; /* length of ring buffer(s) */ 

Mil = ie MS = 13 /x increment for index registers x/ 

Ið = address of oldest value in w; 14 = start of weight table a; 

MX®@ = DM[IQ]; MYO = PM[I4]; /* loading oldest w[] & ao */ 

MR = ð; 10 = IQ + M1; I4 = I4 + M5; /* ring buffer aware add */ 

for (k=0; k<(n-—1); k++) { /x n-1 iterations */ 
MR = MR + MX@ * MYO; MX@ = DM[I@]; MY@ = PM[I4]; /* MAC operation */ 
IQ = 10 + M1; TA = 14 + M5; /* ring buffer aware add */ 

} 

MR = MR + MX@ * MYO; x[s] = MR; /* MAC for youngest elem. */ 


} 


The outer loop corresponds to the progressing time. For each iteration of the outer 
loop, we initialize some registers. For the inner loop, a single instruction encodes 
the inner loop body, comprising the following operations: 


e reading of two arguments from argument registers MX®@ and MYO, multiplying 
them, and adding the product to register MR storing partial sums (so-called MAC 
operation), 

e fetching the next elements of arrays a and w from memories PM and DM and 
storing them in argument registers MXQ and MY@, 

e updating pointers to the next arguments, stored in address registers 10 and 14, by 
adding values stored in M1 and M5 and considering lengths in L@ and L4, 

e testing for the end of the loop. 


For given computational requirements, this (limited) form of parallelism leads to 
relatively low clock frequencies. Processors not optimized for DSP would probably 
need several instructions per iteration and would therefore require a higher clock 
frequency if available. V 


In addition to allowing single instruction realizations of loop bodies for filtering, 
DSP processors provide a number of other application domain-oriented features: 


e Saturating arithmetic changes overflow and underflow handling. In standard 
binary arithmetic, wrap-around is used for the values returned after an overflow 
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or underflow. Table 3.1 shows an example in which two unsigned 4 bit numbers 
are added. A carry is generated which cannot be returned in any of the standard 
registers. The result register will contain a pattern of all zeros. No result could be 
further away from the true result than this one. 

In saturating arithmetic, the result is as close as possible to the true result. 
For saturating arithmetic, the largest value is returned in the case of an overflow, 
and the smallest value is returned in the case of an underflow. This approach 
makes sense especially for video and audio applications: the user will hardly 
recognize the difference between the true result value and the largest value that 
can be represented. Also, it would be useless to raise exceptions if overflows 
occur, since it is difficult to handle exceptions in real time. Returning the right 
value is feasible only if we know whether we are dealing with signed or unsigned 
numbers. 

¢ Fixed-point arithmetic: Sometimes, properties of floating-point computations 
[186] are not welcome, and floating-point hardware increases the cost and power 
consumption of processors. Hence, it has been estimated that 80% of the DSP 
processors do not include floating-point hardware [1]. However, in addition to 
supporting integers, many processors support fixed-point numbers. Fixed-point 
data types can be specified by a 3-tuple (wl, iwl, sign), where wi is the total 
word length, iw/ is the integer word length (the number of bits left of the binary 
point), and sign s € {s,u} denotes whether numbers are unsigned or signed. 
See also Fig. 3.23. Furthermore, there may be different rounding modes (e.g., 
truncation) and overflow modes (e.g., saturating and wrap-around arithmetic). 

For fixed-point numbers, the position of the binary point is maintained after 
multiplications (some low-order bits are truncated or rounded). For fixed-point 
processors, this operation is supported by hardware. 

e Real-time capability: Some of the features of modern processors used in PCs 
are designed to improve the average execution time of programs. In many cases, 
it is difficult if not impossible to formally verify that they improve the worst case 
execution time. In such cases, it may be better not to implement these features. 
For example, it is difficult (though not impossible [4]) to guarantee a certain 
speed-up resulting from the use of caches. Therefore, caches are sometimes not 


Table 3.1 Wrap-around vs. 


(4 1 1 1 
saturating arithmetic for 
: ; + 1 (4 1 
unsigned integers 
Standard wrap-around arithmetic 1/0 |© ð Jð 
Saturating arithmetic 1 J1 1 1 
Fig. 3.23 Parameters of a sign binary point 
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Fig. 3.24 Using 64 bit registers for 16 bit data types 


used for embedded applications. Also, virtual addressing and demand paging!* 
are frequently not found in embedded systems. Techniques for computing worst 
case execution times will be presented in subsection 5.2.2. 


Due to the importance of signal processing, instructions for DSP have been added 
to many instruction sets. 


Multimedia and Short Vector Instruction Sets 


Registers and arithmetic units of many modern architectures are at least 64 bit wide. 
Two 32 bit data types, four 16 bit data types, or eight 8 bit data types (“bytes”) can 
be packed into a single 64 bit register (see Fig. 3.24). 

Arithmetic units can be designed such that they suppress carry bits at 32 bit, 16 
bit, or byte boundaries. Multimedia instruction sets exploit this fact by supporting 
operations on packed data types. Such instructions are sometimes called single- 
instruction, multiple-data (SIMD) instructions, since a single instruction encodes 
operations on several data elements. With bytes packed into 64 bit registers, speed- 
ups of up to about eight over non-packed data types are possible. Data types are 
typically stored in packed form in memory. Unpacking and packing are avoided 
if arithmetic operations on packed data types are used. Furthermore, multimedia 
instructions can usually be combined with saturating arithmetic and therefore pro- 
vide a more efficient form of overflow handling than standard instructions. Hence, 
the overall speed-up achieved with multimedia instructions can be significantly 
larger than the factor of eight enabled by operations on packed 64 bit data types. Due 
to the advantages of operations on packed data types, new instructions have been 
added to several processors. For example, so-called streaming SIMD extensions 
(SSE) have been added to Intel’s family of Pentium®-compatible processors [247]. 
New instructions have also been called short vector instructions and introduced by 
Intel® as Advanced Vector Extensions (AVX) [248]. 


'4See Appendix C on p. 401 for an introduction to paging. 
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Very Long Instruction Word (VLIW) Processors 


Computational demands for embedded systems are increasing, especially when 
multimedia applications, advanced coding techniques, or cryptography are involved. 
Performance improvement techniques used in high-performance microprocessors 
are not appropriate for embedded systems: driven by the need for instruction set 
compatibility, processors found, for example, in PCs spend a huge amount of 
resources and energy on automatically finding parallelism in application programs. 
Still, their performance is frequently not sufficient. For embedded systems, we can 
exploit the fact that instruction set compatibility with PCs is not required. Therefore, 
we can use instructions which explicitly identify operations to be performed in 
parallel. This is possible with explicit parallelism instruction set computers 
(EPICs). With EPICs, detection of parallelism is moved from the processor to the 
compiler. This avoids spending silicon and energy on the detection of parallelism 
at run-time. As a special case, we consider very long instruction word (VLIW) 
processors. For VLIW processors, several operations or instructions are encoded 
in a long instruction word (sometimes called instruction packet) and are assumed 
to be executed in parallel. Each operation/instruction is encoded in a separate field 
of the instruction packet. Each field controls certain hardware units. Four such fields 
are used in Fig. 3.25, each one controlling one of the hardware units. 

For VLIW architectures, the compiler has to generate instruction packets. This 
requires that the compiler is aware of the available hardware units and schedules 
their use. 

Instruction fields must be present, regardless of whether or not the corresponding 
functional unit is actually used in a certain instruction cycle. As a result, the code 
density of VLIW architectures may be low if insufficient parallelism is detected to 
keep all functional units busy. The problem can be avoided if more flexibility is 
added. 

For example, the Texas Instruments TMS 320C6xx family of processors imple- 
ments a variable instruction packet size of up to 256 bits. In each instruction field, 
1 bit is reserved to indicate whether or not the operation encoded in the next field 
is still assumed to be executed in parallel. No instruction bits are wasted for unused 
functional units. Due to its variable length instruction packets, TMS 320C6xx 
processors do not quite correspond to the classical model of VLIW processors. Due 
to their explicit description of parallelism, they are EPIC processors, though. 
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Fig. 3.26 Branch instruction and delay slots 


Implementing register files for VLIW and EPIC processors is far from trivial. 
Due to the large number of operations that can be performed in parallel, a large 
number of register accesses has to be provided in parallel. Therefore, a large number 
of ports is required. However, the delay, size, and energy consumption of register 
files increase with their number of ports. Hence, register files with very large 
numbers of ports are inefficient. As a consequence, many VLIW/EPIC architectures 
use partitioned register files. Functional units are then only connected to a subset of 
the registers. 


VLIW Pipelines 


A potential problem of VLIW and EPIC architectures is their possibly large delay 
penalty: this delay penalty might originate from branch instructions found in some 
instruction packets. Instruction packets normally must pass through pipelines. Each 
stage of these pipelines implements only part of the operations to be performed by 
the instructions executed. Branch instructions cannot be detected in the first stage 
of the pipeline. When the execution of the branch instruction is finally completed, 
additional instructions have already entered the pipeline (see Fig. 3.26). 
There are essentially two ways to deal with these additional instructions: 


1. They are executed as if no branch had been present. This case is called delayed 
branch. Instruction packet slots that are still executed after a branch are called 
branch delay slots. These branch delay slots can be filled with instructions 
which would be executed before the branch if there were no delay slots. However, 
it is normally difficult to fill all delay slots with useful instructions, and some 
must be filled with no-operation instructions (NOPs). The term branch delay 
penalty denotes the loss of performance resulting from these NOPs. 

2. The pipeline is stalled until instructions from the branch target address have been 
fetched. There are no branch delay slots in this case. In this organization, the 
branch delay penalty is caused by the stall. 
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Fig. 3.27 Intel® Core™ Duo Processor 


Branch delay penalties can be significant, and efficiency can be improved by 
avoiding branches if possible. In order to avoid branches originating from if 
statements, predicated instructions have been introduced (see p. 149). 

The Crusoe™ processor is a (commercially finally unsuccessful) example of an 
EPIC processor designed for PCs [295]. Its instruction set includes 64 bit and 128 
bit VLIW instructions. Efforts for making EPIC instruction sets available in the PC 
sector resulted in Intel’s IA-64 instruction set [249] and its implementation in the 
Itanium® processor. Due to legacy problems, it has been used mainly in the server 
market. Many MPSoCs (see p. 162) are based on VLIW and EPIC processors. 


Multi-core Processors 


Processor features for single processors described above have helped to design high- 
performance processors in a resource-aware manner. However, it turned out that a 
further performance increase for single processors hits the power wall: a further 
increase in clock speeds would result in a too large power consumption and in 
too hot circuits. Further increase in the level of VLIW parallelism was not feasible 
either. Due to advances in fabrication technology, it is now feasible to manufacture 
multiple processors on the same semiconductor die. Multiple processors integrated 
on the same chip are called multicores. This is in contrast to multiprocessor 
systems which have been used in computing centers for decades. The integration 
of multiple cores on the same die enables a much faster communication, compared 
to multiprocessor systems. Also, this approach facilitates the sharing of resources 
(such as caches) among the cores. As an example, Fig.3.27 demonstrates the 
architecture of the Intel® Core™ Duo [540]. 

In this case, L1 caches are private, whereas L2 caches are shared. Implementing 
efficient accesses to caches needs some consideration [540]. With such architec- 
tures, cache coherence is becoming an issue also within one die. This means, we 
have to know whether updates of data and possibly also instructions by one core 
are seen by the others. Protocols for automatic cache coherence (like the MESI 
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Fig. 3.28 ARM® Cortex® -A15 pipeline 


protocol) are known for many years in computer architecture [211]. Now, they have 
to be implemented on the chip. Scalability is an issue: for how many cores can we 
reasonably provide enough bandwidth in the communication architecture to always 
keep caches coherent? Also, the system memory bandwidth may be insufficient for 
a growing number of cores. Architectures other than the above Intel architecture 
exist. 

In the architecture of Fig.3.27, all processors are of the same type. Such 
an architecture is called a homogeneous multi-core architecture. Advantages of 
homogeneous multi-core architectures include the fact that the design effort is 
limited (processors will be replicated) and that software can easily be migrated from 
one processor to another one. This is very useful in case one of the cores fails. 

In contrast to homogeneous multi-core architectures, there are also hetero- 
geneous multi-core architectures incorporating processors of different types. 
Processors which are best suited for certain applications can be selected. Typically, 
heterogeneous architectures achieve the best energy efficiency that is feasible. 

In order to find a good compromise between homogeneous and (totally) het- 
erogeneous architectures, architectures with a single instruction set but different 
internal architectures, so-called single-ISA heterogeneous multi-cores [316], have 
been proposed. The ARM® big.LITTLE architecture is a very prominent example 
of this. 

Figure 3.28 contains the pipeline architecture of the Cortex® -A15 processor 
[165]. 

It is acomplex pipeline, containing multiple pipeline stages for instruction fetch, 
instruction decoding, instruction issue, execution, and write-back. Instructions have 
to pass through at least 15 pipeline stages before their result is stored. Dynamic 
scheduling of instructions allows executing instructions in a sequence different from 


3.3 Processing Units 159 


Ex1}|Ex2-———_ Integer pipe ——————__ | WB 

F1}-S4F2}-4F3}-+|Del/SIss| M1 || M2}———— MAC pipe ——————— WB 
Floating pipe f 

F1 ||F2 eei F3 H F4 || F5 


Fig. 3.29 ARM® Cortex® -A7 pipeline 
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Fig. 3.30 DVFS curves for a large, representative workload on a single A7 or A15 


the one in which they are fetched from memory (so-called out-of-order execution). 
Several instructions can be issued in one clock cycle (so-called multi-issue). The 
architecture offers a high performance but requires much power. 

In contrast, Fig. 3.29 shows the pipeline of the Cortex® -A7 architecture [165]. 

It is a simple pipeline. Instructions pass through 8 to 11 stages; they are 
always processed in the order in which they are fetched from memory (so- 
called in-order execution). There are few situations in which two instructions are 
issued concurrently. Hence, the architecture is power-efficient but has a limited 
performance. 

Figure 3.30 [165] demonstrates trade-offs between power consumption and 
performance. For each of the two architectures shown, there is flexibility for these 
two objectives, depending upon the supply voltage and the clock frequency. 

Obviously, the Cortex® -A15 is more appropriate for more demanding high- 
performance applications, e.g., in video processing. The Cortex® -A7 is more 
appropriate for “always-on applications” like low-volume message processing. It 
would be a waste of energy if mobile phones would only contain Cortex® -A15 
cores. 
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Therefore, today’s multi-core chips typically are heterogeneous in that they 
contain a mixture of high-performance and energy-efficient processors, as in 
Fig. 3.31. 


Graphics Processing Units (GPUs) 


In the last century, many computers used specialized graphics processing units 
(GPUs) in order to generate an appealing graphical representation of computer 
output. This hardwired solution suffered from being unable to support non-standard 
computer graphics algorithms. Therefore, these highly specialized GPUs have been 
replaced by programmable solutions. Current GPUs try to run a large number 
of computations concurrently in order to achieve the desired performance. The 
standard approach to concurrency is to run many fine-grained threads at the same 
time. The goal is to keep many processing units busy and to hide memory latencies 
by fast switching between threads. 


Example 3.6 Let us consider the multiplication of two large matrices on a GPU. 
Figure 3.32 [211] shows how the computations can be mapped to a GPU. 

The matrix is partitioned into so-called thread blocks. Each thread block can be 
allocated to one of the cores contained in a GPU. Each thread block, in turn, contains 
a number of threads, and each thread includes a number of instructions. In Fig. 3.32, 
the overall set of computations is called a grid. V 


Each core will try to achieve progress by executing threads. If some thread gets 
blocked, e.g., due to waiting for memory, the core will execute some other thread. 
The instructions contained in a thread can also be executed concurrently, e.g., 
by using multiple pipelines. The thread blocks can be executed concurrently on 
contemporary GPUs. Fast switching between the execution of threads and in this 
way hiding memory latencies is an essential feature for GPUs. 


Example 3.7 Figure 3.33 shows the architecture of the ARM® Mali™ -T880 GPU 
[23]. 

The architecture is defined as intellectual property (IP), comprising a synthe- 
sizable model. In this model, the number of SC cores is configurable between 
1 and 16. Each core includes several pipelines for the execution of arithmetic, 
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Fig. 3.32 Partitioning of matrix multiplication for execution of a GPU 
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Fig. 3.33 ARM® Mali™ -T880 GPU 
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Fig. 3.34 ARM® big.LITTLE system on a chip (SoC) 


load/store, or texture-related instructions. In the thread issue hardware, as many 
threads as possible are issued each clock phase. The GPU also contains additional 
components like a memory management unit (see Appendix C), up to two caches 
and an AMBA® bus interface. Programming support includes an interface to the 
OpenGL library [484] and to OpenCL (see https://www.khronos.org/opencl/). V 


In general, GPU computing achieves high performances in an energy-efficient 
way (see also Sect. 3.7.3 on p. 193). 


Multiprocessor Systems on a Chip (MPSoCs) 


Going one step further, heterogeneous multi-core systems have also been merged 
with GPUs. 


Example 3.8 Figure 3.34 shows a contemporary heterogeneous multi-core system, 
also comprising a Mali GPU [22]. 

The architecture shown in Fig.3.34 does not only contain processor cores. 
Rather, it comprises a number of additional system components, such as memory 
management units (see Appendix C) and interfaces for peripheral devices. Overall, 
the idea behind this integration is to avoid extra chips for such functionality. As a 
result, a whole system is integrated on one chip. Therefore, we are calling such an 
architecture a system-on-a-chip (SoC) or even a multiprocessor system-on-a-chip 
(MPSoC) architecture. V 
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Fig. 3.35 MPSoC 66AK from Texas Instruments® containing ARM® and C6xxx processors 
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Mapping techniques for such processors are important, since examples demon- 
strate that a power efficiency close to that of ASICs can be achieved. For example, 
for IMEC’s ADRES processor, an efficiency of 55 * 10° operations per watt (about 
50% of the power efficiency of ASICs) has been predicted [363, 481]. However, the 
design effort for such architectures is larger than in the homogeneous case. 


Example 3.9 There are MPSoCs comprising processors which we introduced 
earlier: 66AK2x MPSoCs from Texas Instruments contain ARM® and C66xxx pro- 


cessors [530] (see Fig. 3.35), demonstrating relevance of the presented processors. 
V 


The number and the diversity of components can be even larger. For example, 
there may be specialized processors for mobile communication or image processing. 


Example 3.10 Figure 3.36 contains a simplified floor-plan of the SH-MobileG1 
chip [205]. The chip demonstrates that highly specialized processors are being used. 
There are special processors for image processing (red), for GSM and 3G mobile 
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Fig. 3.37 Tensor processing unit (TPU), v1, for fast classification [277, 448] 


communication (green), etc. In order to save energy, power is shut down for unused 
areas, causing these areas to be a special case of dark silicon (c.f. p. 14). V 


Specialized processors are used since progress in semiconductor manufacturing 
and the design of new architectures is slowing down. Hence, specialized processors 
are needed to meet performance targets. This view is supported by the architecture 
which we will present next. 


Example 3.11 Around 2013, Google predicted that it would soon become very 
expensive to provide the expected pattern recognition performance in their data 
centers with conventional CPUs or GPUs. As a result, the design of specialized 
machine learning processors for fast classification with deep neural networks 
(DNNs) was started with a high priority. The resulting so-called Tensor Processing 
Unit (TPU) architecture is shown in Fig. 3.37. 

At the core of the architecture, there is a 256 by 256 array of MAC units. 
64k 8 bit MAC operations can be performed in a single cycle; 16 bit operations 
require more cycles. DNNs consist of layers of computations, where at each 
layer MAC operations involving weight factors are required. These are performed 
by “pumping” input data or data from intermediate layers through the MAC 
matrix. Each cycle, 256 result values become available. TPU version | outperforms 
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commonly used CPUs and GPUs by a factor of 29.2 and 13.3, respectively. The 
performance/power ratio is improved by factors of 34 and 16, respectively. More 
recently, Google announced second- and third-generation TPUs [93]. They do also 
support training DNNs. V 


3.3.3 Reconfigurable Logic 


In many cases, full-custom hardware chips (ASICs) are too expensive, and software- 
based solutions are too slow or too energy-consuming. Reconfigurable logic 
provides a solution if algorithms can be efficiently implemented in custom hardware. 
It can be almost as fast as special-purpose hardware, but in contrast to special- 
purpose hardware, the performed function can be changed by using configuration 
data. Due to these properties, reconfigurable logic finds applications in the following 
areas: 


¢ Fast prototyping: Modern ASICs can be very complex and the design effort can 
be large and take a long time. It is therefore frequently desirable to generate a 
prototype, which can be used for experimenting with a system which behaves 
“almost” like the final system. The prototype can be more costly and larger than 
the final system. Also, its power consumption can be larger than the final system, 
some timing constraints can be relaxed, and only the essential functions need 
to be available. Such a system can then be used for checking the fundamental 
behavior of the future system. 

e Low-volume applications: If the expected market volume is too small to justify 
the development of special-purpose ASICs, reconfigurable logic can be the right 
hardware technology for applications, for which software would be too slow or 
too inefficient. 

e Real-time systems: The timing of reconfigurable logic-based designs is typically 
known very precisely. Therefore, they can be used to implement timing-predic- 
table systems. 

e Applications benefiting from a very high level of parallel processing: For 
example, parallel searches for certain patterns can be implemented as parallel 
hardware. Therefore, reconfigurable logic is employed in searches for genetic 
information, for patterns in Internet messages, in stock data, in seismic analysis, 
and more. 


Reconfigurable hardware frequently includes random access memory (RAM) to 
store configurations. We distinguish between persistent and volatile configuration 
memory. For persistent memory, information is retained when power is shut off. 
For volatile memory, the information is lost once power is shut down. If the 
configuration memory is volatile, its content must be loaded from some persistent 
storage technology such as read-only memories (ROMs) or flash memories at 
startup. 
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Field programmable gate arrays (FPGAs) are the most common form of 
reconfigurable hardware. As the name indicates, such devices are programmable 
“in the field” (after fabrication). Furthermore, they consist of arrays of processing 
elements. As an example, Fig.3.38 shows the column-based structure of the 
Xilinx® UltraScale architecture [602].!° Some columns contain I/O interfaces, 
clock devices, and/or RAM. Other columns comprise configurable logic blocks 
(CLBs), special hardware for digital signal processing, and some RAM. CLBs 
are the key components. They provide configurable functions. The architecture of 
Xilinx® UltraScale CLBs is shown in Fig. 3.39 [599]. 

In this architecture, each CLB contains eight blocks. Each block comprises a 
RAM which is used to implement logic functions by a look-up table (LUT, shown 
in red), two registers, multiplexers, and some additional logic.Each LUT has six 
address inputs and two outputs. It can implement any single Boolean function of 
six variables or two functions of five variables (provided that the two functions 
share input variables). This means that all 264 functions of 6 variables or all 23? 
functions of 5 inputs can be implemented! This is the key means for achieving 


ISRotation of this figure would improve its readability but would contradict the official designation 
of this layout style. 
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configurability. In addition, the logic contained in such a block can also be 
configured. This includes the control of the two registers, which can be programmed 
to store results of the LUT or some direct input values. Blocks in a CLB can be 
combined to form adders, multiplexers, shift registers, or memories. Configuration 
data determines the setting of multiplexers in the CLBs, the clocking of registers 
and RAM, the content of RAM components, and the connections between CLBs. 
Some of the LUTs can also be used as RAM. A single CLB can store up to 512 bits. 

Several CLBs can be combined to create, for example, adders having a larger bit 
width, memories having a larger capacity, or complex logic functions. 

Currently available FPGAs comprise a large number of specialized blocks, 
like hardware for digital signal processing (DSP), some memory, high-speed I/O 
devices for various I/O standards, a decryption facility for FPGA configuration data, 
debugging support, ADCs, high-speed clocking, etc. 


Example 3.12 Virtex® UltraScale™ VU13P devices include 1728 k LUTs, 48 Mbit 
distributed RAM, 94.5 Mbit “Block RAM,” 360 Mbit “UltraRAM,” about 12k 
specialized DSP devices, 4 PCIe® devices, Ethernet interfaces, and up to 832 I/O 
pins [601]. V 


Integration of reconfigurable computing with processors and software is simpli- 
fied if processors are available in the FPGAs. There may be either hard cores or 
soft cores. For hard cores, the layout contains a special area implementing a core in 
a dense way. This area cannot be used for anything but the hard core. Soft cores are 
available as synthesizable models which are mapped to standard CLBs. Soft cores 
are more flexible but less efficient than hard cores. Soft cores can be implemented 
on any FPGA chip. 


Example 3.13 The MicroBlaze processor [598] is an example of a soft core. V 


Example 3.14 At the time of writing this book, hard cores are available, for 
example, on Zynq UltraScale+ MPSoCs. They contain up to four ARM® Cortex- 
A53 cores, two ARM Cortex-R5 cores, and a Mali-400MP2 GPU processor [602]. V 


Typically, configuration data is generated from a high-level description of the 
functionality of the hardware, for example, in VHDL. FPGA vendors provide the 
necessary design kits. Ideally, the same description could also be used for generating 
ASICs automatically. In practice, some interaction is required. Exploitation of the 
available parallelism typically requires manually parallelized applications, since 
automatic parallelization is frequently very limited. The parallelism offered by 
FPGAs is typically not fully exploited if all computations are mapped to processor 
cores. Overall, FPGAs allow implementing a huge variety of hardware devices 
without any need to create hardware other than FPGA boards. 


Example 3.15 Currently (in 2020), alternate providers of FPGAs include Altera® 
(see http://www.altera.com, acquired by Intel®), Lattice Semiconductor (see http:// 
www.latticesemi.com), QuickLogic (see http://www.quicklogic.com), Microsemi 
(formerly Actel; see http://www.microsemi.com), and Chinese vendors. V 
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3.4 Memories 


3.4.1 Conflicting Goals 


Data, programs, and FPGA configurations must be stored in some kind of memory. 
Memories must have a capacity as large as required by the applications, provide 
the expected performance, and still be efficient in terms of cost, size, and energy 
consumption. Requirements for memories also include the expected reliability and 
access granularity (e.g., bytes, words, pages). Furthermore, we distinguish between 
persistent and volatile memory (see p. 165). The mentioned requirements are 
conflicting, as has already been observed by Burks, Goldstine, and von Neumann 
in 1946 [78]: 

“Ideally one would desire an indefinitely large memory capacity such that any 
particular ... word ... would be immediately available — i.e. in a time which is 
... Shorter than the operation time of a fast electronic multiplier. ... It does not seem 
possible physically to achieve such a capacity.” 

Access times of some currently available memories can be estimated with CACTI. 
These estimates are based on the tentative generation of a memory layout and the 
extraction of capacitances [589]. Many different parameters enable the selection of 
an appropriate fabrication technology.!° 


Example 3.16 Figure 3.40 shows the results for a range of exponentially increasing 
sizes [36]. Obviously, the access time increases as a function of the capacity of 
memories: the larger the memory, the longer it takes to access information. In 
addition, Fig. 3.40 also includes the energy consumption. Large memories also tend 
to be energy-inefficient. The impact of the capacity of the memory on the energy 
consumption is even larger than the impact on the access time. V 


16Tn fact, it is frequently difficult to select the right parameters. 
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For a number of years, the difference in speeds between processors and memories 
increased (see Fig. 3.41) until processor clock rates saturated (around 2003). While 
the speed of memories increased by only a factor of about 1.07 per year, overall 
processor performance increased by a factor of 1.5-2 per year [358]. Overall, the gap 
between processor performance and memory speeds has become large. Accordingly, 
a further increase of the overall performance is made at least very difficult due 
to memory access times. This fact has also been called the memory wall [358]. 
Further increase of clock rates of single processors has come to a standstill, but the 
large gap remains which existed when clock speeds became essentially saturated 
and multi-cores require additional memory bandwidth. As a result, we have to find 
compromises between the different requirements for the memory architecture. 


3.4.2 Memory Hierarchies 


Due to the observed conflicts, Burks, Goldstine, and von Neumann wrote already 
in 1946 [78]: “We are therefore forced to recognize the possibility of constructing a 
hierarchy of memories, each of which has greater capacity than the preceding but 
which is less quickly accessible.” 

The exact structure of the hierarchy depends on technological parameters and 
also on the application area. Typically, we can identify at least the following levels 
in the memory hierarchy: 


e Processor registers can be seen as the fastest level in the memory hierarchy, with 
only a limited capacity of at most a few hundred words. 

¢ The working memory (or main memory) of computer systems implements 
the storage implied by processor memory addresses. Usually it has a capacity 
between a few megabytes and some gigabytes and is volatile. 

e Typically there is a large access speed difference between the main memory and 
registers. Hence, many systems include some type of buffer memory. Frequently 
used buffer memories include caches, translation look-aside buffers (TLBs; 
see Appendix C), and scratchpad memory (SPM). In contrast to PC-like 
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systems and compute servers, the architecture of these small memories should 
guarantee a predictable real-time performance. A combination of small memories 
containing frequently used data and instructions and a larger memory containing 
the remaining data and instructions is generally also more energy efficient than a 
single, large memory. 

e Memories introduced so far are normally implemented in volatile memory 
technologies. In order to provide persistent storage, some different memory 
technology must be used. For embedded systems, flash memory is frequently 
the best solution. In other cases, hard disks or Internet-based storage solutions 
(like the “cloud”) may be used. 


Memory hierarchies can be exploited in order to achieve a compromise between 
the design goals for the memory. Memory partitioning has been considered, for 
example, by A. Macii [360]. New memory technologies (including persistent 
memories) have the potential to change currently dominating hierarchies [388]. 


3.4.3 Register Files 


The mentioned impact of the storage capacity on access times and energy consump- 
tion applies even to small memories such as register files. Figure 3.42 shows the 
cycle time and the power as a function of the size of memories used as register files 
[471]. The power needs to be considered due to frequent accesses to registers, as a 
result of which they can get very hot. 


3.4.4 Caches 


For caches it is required that the hardware checks whether or not the cache has a 
valid copy of the information associated with a certain address. This check involves 
comparing the tag fields of caches, containing a subset of the relevant address bits 
[211]. If the cache has no valid copy, the information in the cache is automatically 
updated. 
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Caches were initially introduced in order to provide good run-time efficiency. 
The name is derived from the French word cacher (to hide), indicating that 
programmers do not need to see or to be aware of caches, since updating information 
in caches is automatic. However, when large amounts of information need to be 
accessed, caches are not so invisible anymore. This has been demonstrated very 
nicely by Drepper [139]. Drepper analyzed execution times of a program traversing 
a linear list of entries. Each entry contained one 64 bit pointer to the next entry plus 
NPAD 64 bit words. Execution times were measured for a Pentium P4 processor 
comprising a 16 kB level 1 cache requiring 4 processor cycles per access, a 1 MB 
level 2 cache requiring 14 processor cycles per access, and a main memory requiring 
200 cycles per access. Figure 3.43 shows the average number of cycles per access 
to one list element as a function of the total size of the list for the case NPAD=0. 
For small sizes of the list, four cycles are required per list element. This means 
that we are almost always accessing the level 1 cache, since it is large enough for 
this size of the list. If we increase the size of the list, we need eight cycles per 
access on average. In this case, we are accessing the level 2 cache. However, since 
the cache block size is large enough to hold two list elements, only every second 
access is actually an access to the level 2 cache. For even larger lists, the access time 
increases to nine cycles. In these cases, the list is larger than the level 2 cache, but 
automatic prefetching of level 2 cache entries hides some of the access latency of 
the main memory. 

Figure 3.44 shows the average number of cycles per access to one list element as 
a function of the total size of the list for cases NPAD=0, 7, 15, and 31. For NPAD=7, 
15, and 31, prefetching fails due to the larger size of list items. Obviously, we see 
a dramatic increase of access times. This means that the cache architecture has a 
strong impact on the execution times of applications. Increasing cache size will 
only change the size of the application at which this increase in execution times 
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Fig. 3.44 Average number of cycles per access for NPAD=0, 7, 15, 31 
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happens. Clever exploitation of hierarchies can have a large impact on execution 
times. 

So far, we have just looked at the impact of capacity on access times. In the 
context of Fig. 3.40 however, it is obvious that caches potentially also improve the 
energy efficiency of a memory system. Accesses to caches are accesses to small 
memories and therefore require less energy per access than large memories. 

Predicting cache misses and hits at design time is difficult and is a burden for the 
accurate prediction of real-time performance (see p. 246). 


3.4.5 Scratchpad Memories 


Alternatively, small memories can be mapped into the address space (see Fig. 3.45). 

Such memories are called scratchpad memories (SPMs) or tightly coupled 
memories (TCM). SPMs are accessed by a proper selection of memory addresses. 
There is no need for checking tags, as for caches. Instead, the SPM is accessed 
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whenever some simple address decoder is signaling an address to be in the address 
range of the SPM. SPMs are typically integrated together with processors on the 
same die. Hence, they are a special case of on-chip memories. For n-way set 
associative caches, reads are usually reading n entries in parallel and select the right 
entry only afterward. These energy-hungry parallel reads are avoided for SPMs. As 
a result, SPMs are very energy-efficient. 

Figure 3.46 shows a comparison between the energy required per access to the 
scratchpad (SPM) and the energy required per access to the cache. 

For a two-way set associative cache, the two values differ by a factor of about 
three. The values in this example were computed using the energy consumption for 
RAM arrays as estimated by CACTI [589]. A detailed comparison between figures 
of merit for caches and scratchpads was published by Banakar et al. [36]. 

Frequently used variables and instructions should be allocated to the address 
space of SPMs. SPMs can improve the memory access times very predictably if 
the compiler is in charge of keeping frequently used variables in the SPM (see p. 
363). 


3.5 Communication 


Information must be communicated before it can be processed in an embedded 
system. Communication is particularly important for the Internet of Things. Infor- 
mation can be communicated through various channels. Channels are abstract 
entities characterized by the essential properties of communication, like maximum 
information transfer capacity and noise parameters. The probability of communica- 
tion errors can be computed using communication theory techniques. The physical 
entities enabling communication are called communication media. Important media 
classes include wireless media (radio frequency media, infrared), optical media 
(fibers), and wires. 

There is a huge variety of communication requirements between the various 
classes of embedded systems. In general, connecting the different embedded 
hardware components is far from trivial. Some common requirements can be 
identified. 
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3.5.1 Requirements 


The following list contains some of the requirements that must be met: 


Real-time behavior: This requirement has far-reaching consequences on the 
design of the communication system. Several low-cost solutions such as standard 
Ethernet fail to meet this requirement. 

Efficiency: Connecting different hardware components can be expensive. For 
example, point-to-point connections in large buildings are almost impossible. 
Also, it has been found that separate wires between control units and external 
devices in cars significantly add to the cost and the weight of the car. With 
separate wires, it is also difficult to add new components. The need for cost 
efficiency also affects the way in which power is made available to external 
devices. There is frequently the need to use a central power supply to reduce 
the cost. 

Appropriate bandwidth and communication delay: Bandwidth requirements 
of embedded systems may vary. It is important to provide sufficient bandwidth 
without making the communication system too expensive. 

Support for event-driven communication: Polling-based systems provide a 
very predictable real-time behavior. However, their communication delay may 
be too large, and there should be mechanisms for fast, event-oriented communi- 
cation. For example, emergency situations should be communicated immediately 
and should not remain unnoticed until some central controller polls for messages. 
Security/privacy: Ensuring security/privacy of confidential information (confi- 
dentiality) may require the use of encryption. 

Safety/robustness: For safety-critical systems, the required level of safety must 
be achieved. This includes robustness: cyber-physical systems may be used at 
extreme temperatures, close to major sources of electromagnetic radiation, etc. 
Car engines, for example, can be exposed to temperatures of, e.g., less than —20 
and up to +180°C (—4—356 °F). Voltage levels and clock frequencies could be 
affected due to this large variation in temperatures. Still, reliable communication 
must be maintained. 

Fault tolerance: Despite all the efforts for robustness, faults may occur. Cyber- 
physical systems should be operational even after faults, if at all feasible. 
Restarts, like the ones found in PCs, cannot be accepted. This means that retries 
may be required after attempts to communicate failed. A conflict exists with 
the first requirement: if we allow retries, then it is difficult to meet real-time 
requirements. 

Maintainability, diagnosability: Obviously, it should be possible to repair 
embedded systems within reasonable time frames. 


These communication requirements are a direct consequence of the general 


characteristics of embedded/cyber-physical systems mentioned in Chap. 1. Due to 
the conflicts between some of the requirements, compromises must be made. For 
example, there may be different communication modes: one high-bandwidth mode 
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Fig. 3.48 Differential signaling 


guaranteeing real-time behavior but no fault tolerance (this mode is appropriate 
for multimedia streams) and a second fault-tolerant, low-bandwidth mode for short 
messages that must not be dropped. 


3.5.2 Electrical Robustness 


There are some basic techniques for electrical robustness. Digital communication 
within chips is normally using so-called single-ended signaling. For single-ended 
signaling, signals are propagated on a single wire (see Fig. 3.47). 

Such signals are represented by voltages with respect to a common ground (less 
frequently by currents). A single ground wire is sufficient for a number of single- 
ended signals. Single-ended signaling is very much susceptible to external noise. 
If external noise (originating from, e.g., motors being switched on) affects the 
voltage, messages can easily be corrupted. Also, it is difficult to establish high- 
quality common ground signals between a large number of communicating systems, 
due to the resistance (and self-inductance) on the ground wires. This is different for 
differential signaling. For differential signaling, each signal needs two wires (see 
Fig. 3.48). 

Using differential signaling, binary values are encoded as follows: if the voltage 
on the first wire with respect to the second is positive, then this is decoded as '1'; 
otherwise values are decoded as '@'. The two wires will typically be twisted to form 
so-called twisted pairs. There will be local ground signals, but a non-zero voltage 
between the local ground signals does not hurt. Advantages of differential signaling 
include the following: 


e Noise is added to the two wires in essentially the same way. The comparator 
therefore removes almost all the noise. 
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Fig. 3.49 TDMA-based communication 


e The logic value depends just on the polarity of the voltage between the two wires. 
The magnitude of the voltage can be affected by reflections or because of the 
resistance of the wires; this has no effect on the decoded value. 

e Signals do not generate any currents on the ground wires. Hence, the quality of 
the ground wires becomes less important. 

e No common ground wire is required. Hence, there is no need to establish a high- 
quality ground wiring between a large number of communicating partners. 

e Asaconsequence of the properties mentioned so far, differential signaling allows 
a larger throughput than single-ended signaling. 


However, differential signaling requires two wires for every signal, and it also 
requires negative voltages (unless it is based on complementary logic signals using 
voltages for single-ended signals). Differential signaling is used, for example, in 
standard Ethernet-based networks and the universal serial bus (USB). 


3.5.3 Guaranteeing Real-Time Behavior 


For internal communication, computers may be using dedicated point-to-point 
communication or shared buses. Point-to-point communication can have a good 
real-time behavior but requires many connections, and there may be congestion 
at the receivers. Wiring is easier with common, shared buses. Typically, such 
buses use priority-based arbitration if several access requests to the communication 
media exist (see, e.g., [211]). Priority-based arbitration comes with poor timing 
predictability, since conflicts are difficult to anticipate at design time. Priority- 
based schemes can even lead to “starvation” (low-priority communication can be 
completely blocked by higher-priority communication). In order to get around 
this problem, time division multiple access (TDMA) can be used. In a TDMA 
scheme, each partner is assigned a fixed time slot. The partner is only allowed to 
transmit during that particular time slot. Typically, communication time is divided 
into frames. Each frame starts with some time slot for frame synchronization and 
possibly some gap to allow the sender to turn off (see Fig. 3.49, [302]). 
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This gap is followed by a number of slices, each of which serves for communi- 
cating messages. Each slice also contains some gap and guard time to take clock 
speed variations of the partners into account. Slices are assigned to communication 
partners. Variations of this scheme exist. For example, truncation of unused slices 
or the assignment of partners to several slices are feasible. TDMA reduces the 
maximum amount of data available per frame and partner but guarantees a certain 
bandwidth for all partners. Starvation can be avoided. The ARM AMBA bus [21] 
includes TDMA-based bus allocation. 

Communication between computers is frequently based on Ethernet standards. 
For 10 and 100 Mbit/s versions of Ethernet, there can be collisions between various 
communication partners. This means several partners are trying to communicate 
at about the same time and the signals on the wires are corrupted. Whenever this 
occurs, the partners must stop communications, wait for some time, and then retry. 
The waiting time is chosen at random, so that it is not very likely that the next 
attempt to communicate results in another collision. This method is called carrier- 
sense multiple access with collision detection (CSMA/CD). For CSMA/CD, 
communication time can become huge, since conflicts can repeat a large number 
of times, even though this is not very likely. Hence, CSMA/CD cannot be used 
when real-time constraints must be met. 

This problem can be solved with CSMA/CA (carrier-sense multiple access 
with collision avoidance). As the name indicates, collisions are completely 
avoided, rather than just detected. For CSMA/CA, priorities are assigned to all 
partners. Communication media are allocated to communication partners during 
arbitration phases, which follow communication phases. During arbitration 
phases, partners wanting to communicate indicate this on the media. Partners finding 
such indications of higher priority must immediately remove their indication. 

Provided that there is an upper bound on the time between arbitration phases, 
CSMA/CA guarantees a predictable real-time behavior for the partner having the 
highest priority. For other partners, real-time behavior can be guaranteed if the 
higher priority partners do not continuously request access to the media. 

Note that high-speed versions of Ethernet (>1 Gbit/s) also avoid collisions. 
TDMA schemes are also used for wireless communication. For example, mobile 
phone standards like GSM use TDMA for accesses to the communication medium. 


3.5.4 Examples 


e Sensor/actuator buses: Sensor/actuator buses provide communication between 
simple devices such as switches or lamps and the processing equipment. There 
may be many such devices and the cost of the wiring needs special attention for 
such buses. 

e Field buses: Field buses are similar to sensor/actuator buses. In general, they 
are supposed to support larger data rates than sensor/actuator buses. Examples of 
field buses include the following: 
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— Controller Area Network (CAN): This bus was developed in 1981 by 
Bosch and Intel for connecting controllers and peripherals. It is popular in 
the automotive industry, since it allows the replacement of a large amount 
of wires by a single bus. Due to the size of the automotive market, CAN 
components are relatively cheap and are therefore also used in other areas 
such as smart homes and fabrication equipment. CAN is based on differential 
signaling and arbitration using CSMA/CA. The encoding of signals is similar 
to that of serial (RS-232) lines of early PCs, with modifications for differential 
signaling. CSMA/CA-based arbitration does not prevent starvation. This is an 
inherent problem of the CAN protocol. Extensions exist. 

— The Time-Triggered Protocol (TTP) [304]: This is a protocol for fault- 
tolerant safety systems like airbags in cars. 

— FlexRay™ [253]: This is a TDMA protocol which has been developed by 
the FlexRay consortium (BMW, Daimler AG, General Motors, Ford, Bosch, 
Motorola, and Philips Semiconductors). 

FlexRay includes a static as well as a dynamic arbitration phase. The static 
phase uses a TDMA-like arbitration scheme. It can be used for real-time com- 
munication and starvation can be avoided. The dynamic phase provides a good 
bandwidth for non-real-time communication. Communicating partners can be 
connected to up to two buses for fault-tolerance reasons. Bus guardians may 
protect partners against partners flooding the bus with redundant messages, 
so-called babbling idiots. Partners may use their own local clock periods. 
Periods common to all partners are defined as multiples of such local clock 
periods. Time slots allocated to partners for communication are based on these 
common periods. 

The levi simulation allows simulating the protocol in a lab environment 
[495]. 

— LIN (Local Interconnect Network): This is a low-cost communication stan- 
dard for connecting sensors and actuators in the automotive domain [346]. 


— MAP: MAP is a bus designed for car factories. 
— EIB: The European Installation Bus (EIB) is a bus designed for smart homes. 


° The Inter-Integrated Circuit (I?C) Bus : This is a simple low-cost bus designed 
to communicate at short distances (meter range) with relatively low data rates. 
The bus needs only four wires: ground, SCL (clock), SDA (data), and a voltage 
supply line. Data and clock lines are open collector lines (see pp. 89-91). This 
means that connected devices pull these lines only toward ground. Separate 
resistors are needed to pull these lines up. The standard speed of I?C is 100 kb/s, 
but versions for 10 kb/s and up to 3.4 Mb/s do also exist. The voltage on the 
supply voltage line may vary between interfaces. Only the standards for detecting 
high and low logic levels are defined relative to the supply voltage. The bus is 
supported on some micro-controller boards. 

¢ Wired multimedia communication: For wired multimedia communication, 
larger data rates are required. For example, MOST (Media Oriented Systems 


3.6 Output: Interface Between Cyber and Physical World 179 


Transport) is a communication standard for multimedia and infotainment equip- 
ment in the automotive domain [402]. Standards like IEEE 1394 (FireWire) may 
be used for the same purpose. 

Wireless communication: This kind of communication is becoming more 
popular. Standards for wireless communication include the following: 


— Mobile communication is becoming available at increased data rates. 7 
Mbit/s are obtained with HSPA (High Speed Packet Access). About ten times 
higher rates are available with long-term evolution (LTE). 5G networks are 
expected to provide data rates between 50 Mbit/s and more than a gigabit/s, 
with latencies less than those of earlier networks. 

— Bluetooth is a standard for connecting devices such as mobile phones and 
their headsets over short distances. 

— Wireless local area networks (WLANs) are standardized as IEEE standard 
802.11, with several supplementary standards. 

— ZigBee (see http://www.zigbee.org) is a communication protocol designed to 
create personal area networks using low-power radios. Applications include 
home automation and the Internet of Things. 

— Digital European cordless telecommunications (DECT) is a standard used 
for wireless phones. It is being used throughout the world, except for different 
frequencies used in North America (see https://en. wikipedia.org/wiki/Digital_ 
Enhanced_Cordless_Telecommunications). 
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Output devices are key components of the cyphy-interface. Examples include: 


Displays: Display technology is an area which is extremely important. Accord- 
ingly, a large amount of information [503] exists on this technology. Major 
research and development efforts lead to new display technology such as organic 
displays [342]. Organic displays are emitting light and can be fabricated with 
very high densities. In contrast to LCDs, they do not need backlight and 
polarizing filters. Major changes are therefore expected in these markets. 
Electro-mechanical devices: These influence the environment through motors 
and other electro-mechanical equipment. 


Analog as well as digital output devices are used. In the case of analog 


output devices, the digital information must first be converted by digital-to-analog 
converters (DACs). These converters can be found on the path from analog inputs 
of embedded systems to their outputs. Figure 3.50 shows the naming convention 
of signals along the path which we use. Purpose and function of the boxes will be 
explained in this section. 
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3.6.1 Digital-to-Analog Converters 


Digital-to-analog converters (DACs) are also included in the cyphy-interface. They 
are not very complex. Figure 3.51 shows the schematic of a simple so-called 
weighted-resistor DAC. 

The key idea of the converter is to first generate a current which is proportional 
to the value represented by a digital signal x. Such a current can hardly be used by 
a following system. Therefore, this current is converted into a proportional voltage 
y. This conversion is done with an operational amplifier (depicted by a triangle 
in Fig. 3.51). Essential characteristics of operational amplifiers are described in 
Appendix B of this book. 

How do we compute the output voltage y? Consider the four resistors on the left 
in Fig. 3.51. The current through any resistor is zero if the corresponding element of 
digital signal x is '@'. If itis '1', the current corresponds to the weight of that bit, 
since resistor values are chosen accordingly. Now, consider the loop indicated by the 
red dashed line in Fig. 3.51. We can apply Kirchhoff’s loop rule (see Appendix B) to 
the loop turned on by the least significant bit xo of x. Let us start the loop traversal at 
the corresponding resistor and continue in a clockwise fashion. The second term is 
the voltage V_ between the inputs of the operational amplifier, counted as positive, 
since we proceed in the direction of the arrow. The third term is contributed by the 
constant voltage source, counted as negative, since we proceed against the direction 
of the arrow. Overall, we have 


xo * 1p *8* R+ V_ — Veer = 0 (3.22) 
V_ is approximately 0 (see Appendix B, Eq. (B.14)). Therefore, we have 


Vref 


8x R 


Io = xo * (3.23) 
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Corresponding equations hold for the currents 7; to /3 through the other resistors. 
We can now apply Kirchhoff’s node rule to the circuit node connecting all resistors. 
At this node, the outgoing current must be equal to the sum of the incoming currents. 
Therefore, we have 


l=h+h+h+I1o (3.24) 
V, V, V, V, 
I = x3 * ref + x2 * st +x * an + xo * =k 
3 
V, ; 
= a * So xi x 2173 (3.25) 
i=0 


Now, we can apply Kirchhoff’s loop rule to the loop comprising Rj, y, and V_. 
Since V_ is approximately 0, we have 


y+ R,*I' =0. (3.26) 


Next, we can apply Kirchhoff’s node rule to the node connecting 7, I’, and the 
inverting signal input of the operational amplifier. The current into this input is 
practically zero, and currents J and 7’ are equal: J = I’. Hence, we have 


y+RıxI=0 (3.27) 


From Eqs. (3.25) and (3.27), we obtain 
Ri 3 ; Rı 
Y= Vref * y * 2% + 2173 = —Vpef * a nat (x) (3.28) 


nat denotes the natural number represented by digital signal x. Obviously, y is 
proportional to the value represented by x. Positive output voltages and bit vectors 
representing two’s complement numbers require minor extensions. 

From a DSP point of view, y(t) is a function over a discrete time domain: it 
provides us with a sequence of voltage levels. In our running example, it is defined 
only over integer times. From a practical point of view, this is inconvenient, since 
we would typically observe the output of the circuit of Fig.3.51 continuously. 
Therefore, DACs are frequently extended by a “zero-order hold” functionality. 
This means that the converter will keep the previous value until the next value is 
converted. Actually, the DAC of Fig. 3.51 will do exactly this if we do not change 
the settings of the switches until the next discrete time instant. Hence, the output of 
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the converter is a step function y’(t) corresponding to the sequence y(t).!’ y’ (t) is 
a function over the continuous time domain. 

As an example, let us consider the output resulting from the conversion of the 
signal of Eq. (3.3), assuming a resolution of 0.125. For this case, Fig. 3.52 shows 
y’(t) instead of y(t), since y’(t) is a bit easier to visualize. 

DACs enable a conversion from time- and value-discrete signals to signals in 
the continuous time and value domain. However, neither y(t) nor y’(t) reflects the 
values of the input signal in between the sampling instances. 


3.6.2 Sampling Theorem 


Suppose that the processors used in the hardware loop forward values from ADCs 
unchanged to the DACs. We could also think of storing values x(t) on a CD and 
aiming at generating an excellent analog audio signal. Would it be possible to 
reconstruct the original analog voltage e(t) (see Figs. 3.8, 3.21, and 3.50) at the 
outputs of the DACs? 

It is obvious that reconstruction is not possible if we have aliasing of the type 
described in Fig.3.7 on p. 134.!8 So, we assume that the sampling rate is larger 
than twice the highest frequency of the decomposition of the input signal into sine 


'In practice, due to rise and fall times being > 0, transitions from one step to the next will not be 
ideal, but take some time. 

'8Reconstruction may be possible if additional information about the signal is available, e.g., if we 
restrict ourselves to certain signal types. 
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waves (sampling criterion; see Eq. (3.8)). Does meeting this criterion allow us to 
reconstruct the original signal? Let us have a closer look! 

Feeding DACs with a discrete sequence of digital values will result in a 
sequence of analog values being generated. Values of the input signal in between 
the sampling instances are not generated by DACs. The simple zero-order hold 
functionality (if present) would generate only step functions. This seems to indicate 
that reconstruction of e(t) would require an infinitely large sampling rate, such that 
all intermediate values can be generated. 

However, there could be some kind of smart interpolation computing values in 
between the sampling instances from the values at sampling instances. And indeed, 
sampling theory [440] tells us that a corresponding time-continuous signal z(t) can 
be constructed from the sequence y(t) of analog values. 

Let {ts}, s =..., —1, 0, 1, 2, ... be the time points at which we sample our input 
signal. Let us assume a constant sampling rate of fs = + (Ys : Ty = ts41 — ty). 
Then, sampling theory tells us that we can approximate e(t) from y(t) as follows: 


z(t) = > 


s=— 00 


y(ts)sin F(t — ts) 
T-t) 


(3.29) 


This equation is known as the Shannon-Whittaker interpolation. y(t;) is the 
contribution of signal y at sampling instance f,. This means, all 264 Boolean 
functions of 6 inputs respectively all 2°? Boolean functions of 5 inputs can be 
implemented. The decrease follows a weighting factor, also known as the sinc 
function 


sin(#(t — ts)) 


T(t — ts) 


(3.30) 


sinc(t — ts) = 


which decreases non-monotonically as a function of |t — ts|. This weighting factor 
is used to compute values in between the sampling instances. Figure 3.53 shows the 
weighting factor for the case T; = 1. 

Using the sinc function, we can compute the terms of the sum in Eq. (3.29). 
Figures 3.54 and 3.55 show the resulting terms if e(t) = e3(t) and processing 
performs the identify function (x(t) = w(t)). 

At each of the sampling instances ts (integer times in our case), z(ts) is computed 
just from the corresponding value y(t;), since the sinc function is zero in this 
case for all other sampled values. In between the sampling instances, all of the 
adjacent discrete values contribute to the resulting value of z(t). Figure 3.56 shows 
the resulting z(t) if e(t) = e3(t) and processing performs the identify function 
(x(t) = w(t)). 

The figure includes signals e3(t) (blue), y’(t) (red), and z(t) (magenta). z(t) is 
computed by summing up the contributions of all sampling instances shown in the 
diagrams in Figs. 3.54 and 3.55. e3(t) and z(t) are very similar. 
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How close could we get to the original input signal by implementing Eq. (3.29)? 
Sampling theory tells us (see, e.g., [440]) that Eq. (3.29) computes an exact 
approximation if the sampling criterion (Eq. (3.8)) is met. Therefore, let us see 
how we can implement Eq. (3.29). 

How do we compute Eq. (3.29) in an electronic system? We cannot compute 
this equation in the discrete time domain using a digital signal processor for this, 
since this computation has to generate a time-continuous signal. Computing such a 
complex equation with analog circuits seems to be difficult at first sight. 

Fortunately, the required computation is a so-called folding operation between 
signal y(t) and the sinc function. According to the classical theory of Fourier 
transforms, a folding operation in the time domain is equivalent to a multiplication 
with frequency-dependent filter function in the frequency domain. This filter 
function is the Fourier transform of the corresponding function in the time domain. 
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Fig. 3.57 Converting signal e(t) from the analog time/value domain to the digital domain and 


back 


Therefore, Eq. (3.29) can be computed with some appropriate filter. Figure 3.57 
shows the corresponding placement of the filter. 

Which frequency-dependent filter function is the Fourier transform of the sinc 
function? Computing the Fourier transform of the sinc function yields a low-pass 


186 3 Embedded System Hardware 
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filter function [440]. So, “all” we must do to compute Eq. (3.29) is to pass signal 
y(t) through a low-pass filter, filtering frequencies as shown for the ideal filter 
in Fig.3.58. The representation of function y(t) as a sum of sine waves would 
require very high-frequency components, making such a filtering non-redundant, 
even though we have already assumed an anti-aliasing filter to be present at the 
input. 

Unfortunately, ideal low-pass filters do not exist. We must live with compromises 
and design filters approximating the low-pass filters. Actually, we must live with 
several imperfections preventing a precise reconstruction of the input signals: 


e Ideal low-pass filters cannot be designed. Therefore, we must use approximations 
of such filters. Designing good compromises is an art (performed extensively, 
e.g., for audio equipment). 

e Similarly, we cannot completely remove input frequencies beyond the Nyquist 
frequency. 

e The impact of value quantization is visible in Fig. 3.56. Due to value quantiza- 
tion, e3(t) is sometimes different from z(t). Quantization noise, as introduced by 
ADCs, cannot be removed during output generation. Signal w(t) from the output 
of the ADC will remain distorted by the quantization noise. However, this effect 
does not affect the signal h(t) from the output of sample-and-hold circuits. 

e Equation (3.29) is based on an infinite sum, involving also values at future 
instances in time. In practice, we can delay signals by some finite amount to know 
a finite number of “future” samples. Infinite delays are impossible. In Fig. 3.56, 
we did not consider contributions of sampling instances outside the diagram. 


The functionality provided by low-pass filters demonstrates the power of analog 
circuits: there would be no way of implementing the behavior of analog filters in the 
digital domain, due to the inherent restriction to discretized time and values. 

Many authors have contributed to sampling theory. Therefore, many names can 
be associated with the sampling theorem. Contributors include Shannon, Whittaker, 
Kotelnikov, Nyquist, Kiipfmiiller, and others. Therefore, the fact that the original 
signal can be reconstructed should simply be called the sampling theorem, since 
there is no way of attaching all names of relevant contributors to the theorem. 
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3.6.3 Pulse-Width Modulation 


In practice, the presented generation of analog signals has a number of disadvan- 
tages: 


e DACs using an array of resistors are difficult to build. The precision of the 
resistors must be excellent. The deviation of the resistor handling the most 
significant bit from its nominal value must be less than the overall resolution of 
the converter. For example, this means that, for a 14 bit converter, the deviation 
of the real resistance from its nominal value must be in the order of 0.01%. This 
precision is difficult to achieve in practice, in particular over the full temperature 
range. If this precision is not achieved, the converter is not linear, possibly not 
even monotone. 

e In order to generate a sufficient power for motors, lamps, loudspeakers, etc., 
analog outputs would need to be amplified in a power amplifier. Analog power 
amplifiers, such as so-called class A power amplifiers, are very power-inefficient, 
since they contain an always conducting path between the two rails of the power 
supply. This path results in a constant power consumption, irrespective of the 
actual output signal. For very small output signals, the ratio between the actually 
used power and the consumed power is therefore very small. As a result, the 
efficiency of audio power amplifiers for low-volume audio would be terribly bad. 

e It is not easy to integrate analog circuitry on digital micro-controller chips. 
Adding external analog active components increases costs substantially. 


Therefore, pulse-width modulation (PWM) is very popular. With PWM, we are 
using a digital output and generate a digital signal whose duty cycle corresponds to 
the value to be converted. Figure 3.59 shows digital signals with duty cycles of 25% 
and 75%. Such signals can be represented by Fourier series like in Eq. (3.1). For 
applications of PWM, we try to eliminate effects of higher-frequency components. 

PWM signals can be generated by comparing a counter against a value stored 
in a programmable register (see Fig. 3.60). A high voltage is output whenever the 
value in the counter exceeds the value in the register. Otherwise, a voltage close to 
zero is generated. The clock signal of the counter must be programmable to select 
the basic frequency of the PWM signals. In our schematic, we have assumed that 
the PWM frequency is identical for all PWM outputs. Registers must be loaded with 
the values to be converted, typically at the sampling rate of the analog signals. 
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The effort required for filtering higher-frequency components depends upon the 
application. For driving a motor, the averaging takes place in the motor, due to the 
mass of the moving parts in the motor and possibly also due to the self-inductance 
of the motor. Hence, no external components are needed (see Fig. 3.60). For lamps, 
the averaging takes place in the human eye, as long as the frequencies are not too 
low. It may also be okay to drive simple buzzers directly. In other cases, filtering 
out higher-frequency components may be needed. For example, electromagnetic 
radiation caused by higher-frequency components may be unacceptable, or audio 
applications may be demanding filtered high-frequency signals. In Fig. 3.60, two 
capacitors and one inductor have been used to filter out high-frequency components 
for the loudspeakers. In our example, we are showing four PWM outputs. Having 
several PWM outputs is a common situation. For example, Atmel 32 bit AVR micro- 
controllers in the AT32UC3A Series have seven PWM outputs [27]. In practice, 
there are many options for the detailed behavior of PWM hardware. 

The choice of the basic frequency (the reciprocal of the period) of the PWM 
signal and the filter is a matter of compromises. The basic frequency has to be 
higher than the highest-frequency component of the analog signal to be converted. 
Higher frequencies simplify the design of the filter if any is present. Selecting a too 
high frequency results in more electromagnetic radiation and in unnecessary energy 
consumption, since switching will consume energy. Compromises typically use a 
basic PWM frequency that is larger than the highest frequency of the analog signal 
by a factor between 2 and 10. 


3.6.4 Actuators 


There is a huge amount of actuators [151]. Actuators range from large ones that are 
able to move tons of weight to tiny ones with dimensions in the ym area, like the 
one shown in Fig. 3.61. 
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Fig. 3.61 Detail of a rotary 102m 
stepper micromotor: top: 
stationary part; lower left: 
rotary part. The micromotor 
uses three-phase electrostatic 
power [478]. © Sarajlic et al. 
(2010) 


Figure 3.61 shows a tiny motor manufactured with microsystem technology. The 
dimensions are in the um range. The rotating center is controlled by electrostatic 
forces. 

As an example, we mention only a special kind of actuators which will become 
more important in the future: microsystem technology enables the fabrication of 
tiny actuators, which can be put into the human body, for example. Using such tiny 
actuators, the amount of drugs fed into the body can be adapted to the actual need. 
This allows a much better medication than needle-based injections. 

Actuators are important for the Internet of Things. It is impossible to provide a 
complete overview over actuators. 


3.7 Electrical Energy 


General constraints and objectives for the design of embedded and cyber-physical 
systems (see pp. 8—16 and Table 1.2) have to be obeyed for hardware design. Among 
the different objectives, we will focus on energy efficiency. Reasons for caring about 
the energy efficiency were listed in Table 1.1 on p. 13. 


3.7.1 Energy Sources 


For plugged devices (i.e., for those connected to the power grid), energy is easily 
available. For all others, energy must be made available via other techniques. In 
particular, this applies to sensor networks used in IoT systems where energy can 
be a very scarce resource. Batteries store energy in the form of chemical energy. 
Their main limitation is that they must be carried to the location where the energy 
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Fig. 3.62 Photovoltaic material: left, panel; right, solar-powered watch 


is required. If we would like to avoid this limitation, we have to use energy 
harvesting, also called energy scavenging. A large amount of techniques for 
energy harvesting is available [570, 577], but the amount of energy is typically much 
more limited: 


¢ Photovoltaics allows the conversion of light into electrical energy. The conver- 
sion is usually based on the photovoltaic effect of semiconductors. Panels of 
photovoltaic material are in widespread use. Examples can be seen in Fig. 3.62. 

° The piezoelectric effect can be used to convert mechanical strain into electrical 
energy. Piezoelectric lighters exploit this effect. 

¢ Thermoelectric generators (TEGs) allow turning temperature gradients into 
electrical energy. They can be used even on the human body. 

e Kinetic energy can be turned into electrical energy. This is exploited, for 
example, for some watches. Also, wind energy falls into this category. 

e Ambient electromagnetic radiation can be turned into electrical energy as well. 

e There are many other physical effects allowing us to convert other forms of 
energy into electrical energy. 


3.7.2 Energy Storage 


For many applications of embedded systems, power sources are not guaranteed to 
provide power whenever it is needed. However, we may be able to store electrical 
energy. Methods for storing electrical energy include the following: 


1. Non-rechargeable batteries can be used only once and will not be considered. 

2. Capacitors are a very convenient means of storing electrical energy. Their 
advantages include a potentially fast charging process, very high output currents, 
close to 100% efficiency, low leakage currents (for high-quality capacitors), and 
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a large number of charge/discharge cycles. The limited amount of energy that 
can be stored is their main disadvantage. 

3. Rechargeable batteries allow storing and using electrical energy, very much 
like capacitors. Storing electrical energy is based on certain chemical processes, 
and using this energy is based on reversing these chemical processes. 


Due to their importance for embedded systems, we will discuss rechargeable 
batteries. If we want to include sources of electrical energy in our system model, we 
will need models of rechargeable batteries. Various models can be used. They differ 
in the amount of details that are included, and there is not a single model that fits all 
needs [467]. The following models are popular: 


e Chemical and physical models: They describe the chemical and/or physical 
operation of the battery in detail. Such models may include partial differential 
equations, including many parameters. These models are beneficial for battery 
manufacturers but typically too complex for designers of embedded systems 
(who will typically not know the parameters). 

e Simple empirical models: Such models are based on simple equations for which 
some parameter fitting has been performed. Peukert’s law [451] is a frequently 
cited empirical model. According to this law, the lifetime of a battery is 


lifetime = C/I“ (3.31) 


where a > | is the result of some empirical fitting process. Peukert’s law reflects 
the fact that higher currents will typically lead to an effective decrease of the 
battery capacity. Other details of battery behavior are not included in this model. 

e Abstract models: These provide more details than the very simple empirical 
models, but do not refer to chemical processes. We would like to present two 
such models: 


— The model proposed by Chen and Ricón [94]. The model is an electrical 
model, as shown in Fig. 3.63. According to this model, a charging current 
Tgat controls a current source in the left part of the schematic. The current 
generated by the current source is equal to the charging current entering on the 
right. This current will charge the capacitor CCapacity. The amount of charge 
on the capacitor is called state of charge (SoC). The state of charge is reflected 
by the voltage Vsgc on the capacitor, since the charge on the capacitor can 
be computed as Q = Ccapacity * Vsoc. Resistor Rseif—Discharge Models the 
self-discharge (leakage) of this capacitor which happens even when no current 
is drawn at the terminal pins of the battery. 

Let us consider the voltage which is available at the battery terminals when 
the current through these terminals is zero. The voltage at the battery terminals 
will typically non-linearly depend on Vsgc. This dependency can be modeled 
by a non-linear function Voc(Vsoc), representing the open terminal output 
voltage of the battery. This voltage decreases when the battery provides some 
current. For a constant discharging current, Rseries + Rrransient_s Models 
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Fig. 3.63 Battery model according to Chen et al. (simplified) 


the corresponding voltage drop. For short current spikes, the decrease is 
determined by the value of Rseries only, since Cr will act as a buffer. When 
the current consumption increases, time constant RTransient_s* Cr determines 
the speed for the transition from only Rseries causing the voltage drop to 
Rseries + RTransient_s causing the voltage drop. The original proposal by 
Chen et al. includes a second resistor/capacitor pair in order to model transient 
output voltage behavior more precisely. Overall, this model captures the 
impact of high output currents on the voltage, the non-linear dependency of 
the output voltage, and self-discharge reasonably well. Simpler versions of 
this model exist, i.e., ones that do not model all three effects. 

Actual batteries exhibit the so-called charge recovery effect: whenever the 
discharge process of batteries is paused for some time interval, the battery 
recovers, i.e., more charge becomes available, and the voltage is typically also 
increased. This effect is not considered in Chen’s model. However, it is the 
focus of the so-called kinetic battery model (KiBaM) of Manwell et al. [364]. 
The name reflects the analogy upon which this model is based. The model 
assumes two different bins of charge, as shown in Fig. 3.64. The right bin 
contains the charge yı which is immediately available. The left bin contains 
charge y2 which exists in the battery but which needs to flow into the right 
bin to become available. An interval of heavy usage of the battery may almost 
empty the right bin. It will then take some time for charge to become available 
again. The speed of the recovery process is determined by parameter k, the 
width of the pipe connecting the two bins. The details of the model (like the 
amount of charge flowing) reflect the physical situation of the bins. This model 
describes the charge recovery process with some reasonable precision but fails 
to describe transients and self-discharge as captured in Chen’s model. The 
kinetic model has an impact on how embedded systems should be used. For 
example, it has been demonstrated that it is beneficial to plan for intervals, 
during which wireless transmission is turned off [144]. 


Overall, the two models demonstrate nicely that models must be selected to 
reflect the effects that should be taken into account. 


e There may be mixed models which are partially based on abstract models and 


partially on chemical and physical models. 
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Fig. 3.64 Kinetic battery model 


3.7.3 Energy Efficiency of Hardware Components 


We will continue our discussion of energy efficiency by comparing the energy 
efficiency for the different technologies which we have at our disposal. Hardware 
components discussed in this chapter are quite different as far as their energy 
efficiency is concerned. A comparison between these technologies and changes over 
time (corresponding to a certain fabrication technology) can be seen in Fig. 3.65.!? 
The figure reflects the conflict between efficiency and flexibility of currently 
available hardware technologies. 

The diagram shows the energy efficiency GOP/J in terms of number of operations 
per unit of energy of various target technologies as a function of time and the target 
technology. In this context, operations could be 32 bit additions. Obviously, the 
number of operations per joule is increasing as technology advances to smaller 
and smaller feature sizes of integrated circuits. However, for any given technology, 
the number of operations per joule is largest for hardwired application-specific 
integrated circuits (ASICs). For reconfigurable logic usually coming in the form 
of field programmable gate arrays (FPGAs; see p. 165), this value is about one 
order of magnitude less. For programmable processors, it is even lower. However, 
processors offer the largest amount of flexibility, resulting from the flexibility of 
software. There is also some flexibility for reconfigurable logic, but it is limited to 
the size of applications that can be mapped to such logic. For hardwired designs, 
there is no flexibility. The trade-off between flexibility and efficiency also applies to 
processors: for processors optimized for an application domain, such as processors 
optimized for digital signal processing (DSP), power-efficiency values approach 
those of reconfigurable logic. For general standard microprocessors, the values for 
this figure of merit are the worst. This can be seen from Fig. 3.65, comprising 
values for microprocessors such as x 86-like processors (see “MPU” entries), RISC 
processors, and the cell processor designed by IBM, Toshiba, and Sony. 

Figure 3.65 does not identify exactly the applications which are compared, and it 
does not allow us to study the type of application mapping that has been performed. 


!°The figure approximates information provided by H. De Man [363] and is based on information 
provided by Philips. 
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Fig. 3.65 Hardware efficiency (©De Man and Philips) 


More detailed and more recent comparisons have been made, enabling us to study 
the assumptions and the approach of these comparisons in a more comprehensive 
manner. A survey of comparisons involving GPUs has been published by Mittal 
et al. [398]. The survey includes a list of 28 publications for which GPUs have 
been found to be more energy-efficient than CPUs and 2 publications for which 
the reverse was true. Also, the survey comprises a list of 26 publications for which 
FPGAs have been found to be more energy-efficient than GPUs and 1 for which 
the reverse was true. For example, Hamada et al. [200] found for a gravitational n- 
body simulation that the number of operations per watt was by a factor of 15 higher 
for FPGAs than for GPUs. For a comparison against CPUs, the factor was 34. The 
exact factors certainly depend on the application, but as a rule of thumb, we can state 
the following: If we aim at top power- and energy-efficient designs, we should use 
ASICs. If we cannot afford ASICs, we should go for FPGAs. If FPGAs are also not 
an option, we should select GPUs. Also, we have already seen that heterogeneous 
processors are in general more energy-efficient than homogeneous processors. More 
detailed information can be computed for particular application areas. 
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The Case of Mobile Phones 


Among the different applications of embedded systems (see pp. 4—8), we are now 
looking at telecommunication and smart phones. For smart phones, computational 
requirements are increasing at a rapid rate, especially for multimedia applications. 
De Man and Philips estimated that advanced multimedia applications need about 
10-100 billion operations per second. Figure 3.65 demonstrates that advanced hard- 
ware technologies provided us more or less with this number of operations per joule 
(=Ws) in 2007. This means that the most power -efficient platform technologies 
hardly provided the efficiency which was needed. Standard processors (entries 
for MPU and RISC) were hopelessly inefficient. It also meant that all sources of 
efficiency improvements needed to be exploited. More recently, the power efficiency 
has been improved. However, all such improvements are typically compensated by 
trends to provide a higher quality, e.g., by an increase of the resolution of still and 
moving images as well as a higher bandwidth for communication. 

A detailed analysis of the power consumption has been published by Berkel [553] 
and by Carroll et al. [84]. A more recent analysis including LTE mobile phones has 
been published by Dusza et al. [144]. A power consumption of up to around 4 watts 
has been observed. The display itself caused a consumption of up to around 1 watt, 
depending on the display brightness. 

Improving battery technology would allow us to consume power over longer 
periods, but the thermal limitation prevents us from going significantly beyond 
the current consumption in the near future. Due to thermal issues, it has become 
standard to design mobile phones with temperature sensors and to throttle devices 
in case of overheating. Of course, a larger power consumption would be feasible for 
larger devices. Nevertheless, environmental concerns also result in the need to keep 
the power consumption low. 

Technology forecasts have been published as so-called International Technology 
Roadmap for Semiconductors. In the ITRS edition of 2013 [261], it is explic- 
itly stated that mobile phones are driving technological development: “System 
integration has shifted from a computational, PC-centric approach to a highly 
diversified mobile communication approach. The heterogeneous integration of 
multiple technologies in a limited space (e.g., GPS, phone, tablet, mobile phones, 
etc.) has truly revolutionized the semiconductor industry by shifting the main goal 
of any design from a performance driven approach to a reduced power driven 
approach. In few words, in the past performance was the one and only goal; today 
minimization of power consumption drives IC design.” 


Sensor Networks 


Sensor networks used for the Internet of Things are another special case. For sensor 
networks, there may be even much less energy available than for mobile phones. 
Hence, energy efficiency is of utmost importance, comprising of course energy- 
efficient communication [543]. 
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3.8 Secure Hardware 


The general requirements for embedded systems can often include security (see 
p. 9). In particular, security is important for the Internet of Things. If security is a 
major concern, special secure hardware may need to be developed. Security may 
need to be guaranteed for communication and for storage [309]. Security has to be 
provided despite possible attacks and countermeasures must be designed. Attacks 
can be partitioned into the following [300]: 


e Software attacks are based on the execution of software. The deployment of 
software Trojans is an example of such an attack. Also, software defects can 
be exploited. Buffer overflows are a frequent cause of security hazards. Side- 
channel attacks try to exploit additional sources of information complementing 
the specified interfaces. Side-channel attacks based on software execution are 
difficult, but not infeasible. For example, it may be possible to exploit execution 
time information.” Security-relevant algorithms should be designed such that 
their execution time does not depend on data values. This requirement also affects 
the implementation of computer arithmetic: instructions should not have data- 
dependent execution times. 

e Attacks which require physical access and which can be classified into the 
following: 


— Physical attacks try to open a side channel by physically tampering with the 
system. For example, silicon chips can be opened and analyzed. The first step 
in this procedure is de-packaging (removing the plastic covering the silicon). 
Next, micro-probing or optical analysis can be used. Such attacks are difficult, 
but they reveal many details of the chip. 

— Power analysis is another class of attacks. Power analysis techniques include 
simple power analysis (SPA) and differential power analysis (DPA). In some 
cases, SPA may be sufficient to compute encryption keys. In other cases, 
advanced statistical methods may be needed to directly compute keys from 
small statistical fluctuations of measured currents. 

— Analysis of electromagnetic radiation is another class of side-channel 
attacks. 


Different classes of people might try these attacks, and different classes of 
people may have an interest in blocking these attacks. The attacker may actually 
be the user of an embedded device trying to obtain unauthorized network access or 
unauthorized access to protected media such as music. 

We can distinguish between the following countermeasures: 


?0Side-channel attacks based on timing information have been published under the names Spectre 
and Meltdown. They apply to modern processors using speculative execution; see https://en. 
wikipedia.org/wiki/Spectre_(security_vulnerability). 
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e A security-aware software development process is required as a shield against 
software attacks. 

e Tamper-resistant devices include special mechanisms for physical protection 
(shielding, or sensors to detect tampering with the modules). 

e Devices can be designed such that processed data patterns have very little impact 
on the power consumption. This requires special devices which are typically not 
used in complex chips. 

e Logical security, typically provided by cryptographic methods: encryption can 
be based on either symmetric or asymmetric ciphers. 


— For symmetric ciphers, sender and receiver are using the same secret key 
to encrypt and decrypt messages. DES, 3DES, and AES are examples of 
symmetric ciphers. 

— For asymmetric ciphers, messages are encrypted with a public key and 
decrypted with a private key. RSA and Diffie-Hellman are examples of 
asymmetric ciphers. 

— Also, hash codes can be added to messages, allowing the detection of message 
modifications. MD5 and SHA are examples of hashing algorithms. 


Due to the performance gap, some processors may support encryption and 
decryption with dedicated instructions. Also, specialized solutions such as 
ARM’s TrustZone computing exist. “At the heart of the TrustZone approach is 
the concept of secure and non-secure worlds that are hardware separated, with 
non-secure software blocked from accessing secure resources directly. Within the 
processor, software either resides in the secure world or the non-secure world; a 
switch between these two worlds is accomplished via software referred to as the 
secure monitor (Cortex-A) or by the core logic (Cortex-M). This concept of secure 
(trusted) and non-secure (non-trusted) worlds extends beyond the processor 
to encompass memory, software, bus transactions, interrupts, and peripherals 
within an SoC” (see https://www.arm.com/products/security-on-arm/trustzone). 

The Kalray MPPA2® -256 multi-core processor chip contains as many as 
128 specialized crypto co-processors connected to a matrix of 288 “regular” 
cores (see http://www.kalrayinc.com/kalray/products/). Cores are 64 bit VLIW 
processors. 


The following challenges exist for the design of countermeasures [300]: 


1. Performance gap: Due to the limited performance of embedded systems, 
advanced encryption techniques may be too slow, in particular if high data rates 
have to be processed. 

2. Battery gap: Advanced encryption techniques require a significant amount of 
energy. This energy may be unavailable in a portable system. Smart cards are a 
special case of hardware that must run using a very small amount of energy. 

3. Lack of flexibility: Frequently, many different security protocols are required 
within one system, and these protocols may have to be updated from time to 
time. This hinders using special hardware accelerators for encryption. 
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4. Tamper resistance: Mechanisms against malicious attacks need to be built in. 
Their design is far from trivial. For example, it may be difficult if not impossible 
to guarantee that the current consumption is independent of the cryptographic 
keys that are processed. 

5. Assurance gap: The verification of security requires extra efforts during the 
design. 

6. Cost: Higher security levels increase the cost of the system. 


Ravi et al. have analyzed these challenges in detail for a Secure Sockets Layer (SSL) 
protocol [300]. 

More information on secure hardware is available, for example, in a book by 
Gebotys [180] and in proceedings of a workshop series dedicated to this topic (see 
[183] for the most recent edition). 


3.9 Problems 


We suggest solving the following problems either at home or during a flipped 
classroom session: 


3.1 It is suggested that locally available small robots are used to demonstrate 
hardware in the loop, corresponding to Fig. 3.2. The robots should include sensors 
and actuators. Robots should run a program implementing a control loop. For 
example, an optical sensor could be used to let a robot follow a black line on the 
ground. The details of this assignment depend on the availability of robots. 


3.2 Define the term “signal”! 


3.3 Which circuit do we need for the transition from continuous time to discrete 
time? 


3.4 What does the sampling theorem tell us? 


3.5 Assume that we have an input signal x consisting of the sum of sine waves 
of 1.75kHz and 2kHz. We are sampling x at a rate of 3kHz. Will we be able 
to reconstruct the original signal after discretization of time? Please explain your 
result! 


3.6 Discretization of values is based on ADCs. Develop the schematic of a flash- 
based ADC for positive and negative input voltages! The output should be encoded 
as 3 bit two’s complement numbers, allowing to distinguish between eight different 
voltage intervals. 


3.7 Suppose that we are working with a successive approximation-based 4 bit 
ADC. The input voltage range extends from Vmin =1 V (="0000") to Vmax =4.75 V 
(="1111"). Which steps are used to convert voltages of 2.25 V, 3.75 V, and 1.8 V? 
Draw a diagram similar to Fig. 3.12 which depicts the successive approximation to 
these voltages! 
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Table 3.2 Complexity of ADCs 


Flash-based converter Successive approximation converter 
Time complexity 
Space complexity 


3.8 Compare the complexity of flash-based and successive approximation-based 
ADC. Assume that you would like to distinguish between n different voltage 
intervals. Enter the complexity into Table 3.2, using the O-notation. 


3.9 Suppose a sine wave is used as an input signal to the converter designed in 
Problem 3.6. Depict the quantization noise signal for this case! 


3.10 Create a list of features of DSP processors! 


3.11 Which components do FPGAs comprise? Which of these are used to imple- 
ment Boolean functions? How are FPGAs configured? Are FPGAs energy-efficient? 
Which kind of applications are FPGAs good for? 


3.12 What is the key idea of VLIW processors? 


3.13 What is a “‘single-ISA heterogeneous multi-core architecture”? Which advan- 
tages do you see for such an architecture? 


3.14 Explain the terms “GPU” and “MPSoC”! 


3.15 Some FPGAs support an implementation of all Boolean functions of six 
variables. How many such functions exist? We ignore that some functions differ 
only by a renaming of variables. 


3.16 In the context of memories, we are sometimes saying “small is beautiful.” 
What could be the reason for this? 


3.17 Some levels of the memory hierarchy may be hidden from the application pro- 
grammer. Why should such a programmer nevertheless care about the architecture 
of such levels? 


3.18 What is a “scratchpad memory” (SPM)? How can we ensure that some 
memory object is stored in the SPM? 


3.19 Develop the following FlexRay ™ cluster: The cluster consists of the five nodes 
A, B, C, D, and E. All nodes should be connected via two channels. The cluster uses a 
bus topology. The nodes A, B, and C are executing a safety critical task, and therefore 
their bus requests should be guaranteed at the time of 20 macroticks. The following 
is expected from you: 


e Download the levi FlexRay simulator [495]. Unpack the ZIP file and install! 
e Start the training module by executing the file leviFRP jar. 
e Design the described FlexRay cluster within the training module. 
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e Configure the communication cycle such that the nodes A, B, and C have a 
guaranteed bus access within a maximal delay of 20 macroticks. The nodes D 
and E should use only the dynamic segment. 

e Configure the node bus requests. The node A sends a message every cycle. The 
nodes B and C send a message every second cycle. The node D sends a message 
of the length of 2 minislots every cycle, and the node E sends every second cycle 
a message of the length of 2 minislots. 

e Start the visualization and check if the bus requests of the nodes A, B, and C are 
guaranteed. 

e Swap the positions of nodes D and E in the dynamic segment. What is the 
resulting behavior? 


3.20 Develop the schematic of a 3 bit DAC! The conversion should be done for a 3 
bit vector x encoding positive numbers. Prove that the output voltage is proportional 
to the value represented by the input vector x. How would you modify the circuit if 
x represented two’s complement numbers? 


3.21 The circuit shown in Fig. B.4 in Appendix B is an amplifier, amplifying input 
voltage V1: 


Vout = &closed * Vi 


Compute the gain g¢joseq for the circuit of Fig. B.4 as a function of R and Rı! 


3.22 How do different hardware technologies differ with respect to their energy 
efficiency? 


3.23 The computational efficiency is sometimes also measured in terms of billions 
of operations per second per watt. How is this different from the figure of merit used 
in Fig. 3.65? 


3.24 Why is it so important to optimize embedded systems? Compare different 
technologies for processing information in an embedded system with respect to their 
efficiency! 


3.25 Suppose that your mobile phone uses a lithium battery rated at 720 mAh. The 
nominal voltage of the battery is 3.7 V. Assuming a constant power consumption 
of 1 W, how long would it take to empty the battery? All secondary effects such as 
decreasing voltages should be ignored in this calculation. 


3.26 Which challenges do you see for the security of embedded systems? 


3.27 What is a “side-channel attack”? Please provide examples of side-channel 
attacks! 
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Chapter 4 A 
System Software P 


In order to cope with the complexity of applications of embedded systems, reuse 
of components is a key technique. As pointed out by Sangiovanni-Vincentelli 
[476], software and hardware components must be reused in the platform-based 
design methodology (see p. 296). These components comprise knowledge from 
earlier design efforts and constitute intellectual property (IP). Standard software 
components that can be reused include system software components such as 
embedded operating systems (OSs) and middleware. The last term denotes software 
that provides an intermediate layer between the OS and application software. This 
chapter starts with a description of general requirements for embedded operating 
systems. This includes real-time capabilities as well as adaptation techniques to 
provide just the required functionality. Mutually exclusive access to resources 
can result in priority inversion, which is a serious problem for real-time systems. 
Priority inversion can be circumvented with resource access protocols. We will 
present three such protocols: the priority inheritance, priority ceiling, and stack 
resource protocols. A separate section covers the ERIKA real-time system kernel. 
Furthermore, we will explain how Linux can be adapted to systems with tight 
resource constraints. Finally, we will provide pointers for additional reusable 
software components, like hardware abstraction layers (HALs), communication 
software, and real-time data bases. Our description of embedded operating systems 
and of middleware in this chapter is consistent with the overall design flow (see also 
Fig. 4.1). 
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Fig. 4.1 Simplified design information flow 


4.1 Embedded Operating Systems 


4.1.1 General Requirements 


Except for very simple systems, I/O, scheduling, and context switching require the 
support of an operating system suited for embedded applications. Switching from 
the execution of one code object such that some other code object is executed is 
called context switching. Context switching multiplexes processors such that each 
code object seems to have its own processor. For code objects, we distinguish 
between processes and threads. First of all, we define the term “process”: 


Definition 4.1 (Adopted from Tanenbaum [525]) A process is an executed 
program (or a part of a program) including memory content. 


Courses on operating systems provide additional information about this term (e.g., 
in German [472]). In this chapter, we will be using this term in the sense of an entity 
within the operating system (and not in the sense of processes in SDL, VHDL, 
process networks, or semiconductor fabrication). 

For systems with virtual addressing'!, we can distinguish between different 
address spaces. For such systems, we have to distinguish between executions of 
code objects within separate or within the same address spaces. If they are executed 
within separate address spaces, we will call them processes. If they are executed 
within the same address space, we will call them threads (or lightweight processes). 


Definition 4.2 A thread is an executed program using the same address space as 
other programs. 


For processes, there is some form of memory protection, since processes cannot 
corrupt other process memory areas. However, context switches have to change 
address translation information. Hence, they come with some overhead. For threads, 
this protection does not exist. In fact, threads sharing an address space will typically 
communicate via shared memory. Context switching for threads is typically faster 


'See Appendix C. 
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than for processes. We do not need to distinguish between threads and processes if 
there is just one address space. More information about the just touched standard 
topics in system software can be found in textbooks on operating systems, such as 
the book by Tanenbaum [525]. Operating systems have to provide communication 
and synchronization methods for threads and processes. 

The following are essential features of embedded operating systems: 


e Due to the large variety of embedded systems, there is also a large variety 
of requirements for the functionality of embedded OSs. Due to efficiency 
requirements, it is not possible to work with OSs which provide the union of 
all functionalities. For most applications, the OS must be small. Hence, we need 
operating systems which can be flexibly tailored toward the application at hand. 
Configurability is therefore one of the main characteristics of embedded OSs. 
There are various techniques of implementing configurability, including: 


— Object orientation, used for a derivation of proper subclasses: for example, 
we could have a general scheduler class. From this class we could derive 
schedulers having particular features. However, object-oriented approaches 
typically come with an additional overhead. For example, dynamic binding 
of methods does create run-time overhead. Ideas for reducing this overhead 
exist (see, e.g., https://github.com/lefticus/cppbestpractices/blob/master/08- 
Considering Performance.md). Nevertheless, remaining overhead and poten- 
tial timing unpredictability may be unacceptable for performance-critical 
system software. 

— Aspect-oriented programming [352]: with this approach, orthogonal aspects 
of software can be described independently and then can be added automat- 
ically to all relevant parts of the program code. For example, some code for 
profiling can be described in a single module. It can then be automatically 
added to or dropped from all relevant parts of the source code. The CIAO 
family of operating systems has been designed in this way [350]. 

— Conditional compilation: in this case, we are using some macro preproces- 
sor, and we are taking advantage of #if and #ifdef preprocessor commands. 

— Advanced compile-time evaluation: configurations could be performed by 
defining constant values of variables before compiling the OS. The compiler 
could then propagate the knowledge of these values as much as possible. 
Advanced compiler optimizations may also be useful in this context. For 
example, if a particular function parameter is always constant, this parameter 
can be dropped from the parameter list. Partial evaluation [275] provides a 
framework for such compiler optimizations. In a sophisticated form, dynamic 
data might be replaced by static data [26]. A survey of operating system 
specialization was published by McNamee et al. [387]. 

— Linker-based removal of unused functions: at link-time, there may be more 
information about used and unused functions than during earlier phases. For 


>This list is sorted by the position of the technique in the development process or tool chain. 
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example, the linker can figure out, which library functions are used. Unused 
library functions can be accordingly dropped and specializations can take 
place [91]. 


These techniques are frequently combined with a rule-based selection of files 

to be included in the operating system. Tailoring the OS can be made easy 
through a graphical user interface hiding the techniques employed for achieving 
this configurability. For example, VxWorks [590] from Wind River is configured 
via a graphical user interface. 

Verification is a potential problem of systems with a large number of derived 
tailored OSs. Each and every derived OS must be tested thoroughly. Takada 
mentions this as a potential problem for eCos (an open-source RTOS; see 
http://ecos.sourceware.org and Massa [381]), comprising 100-200 configuration 
points [523]. For Linux, this problem is even larger [526]. Software product line 
engineering [456] can contribute toward solving this problem. 

e There is a large variety of peripheral devices employed in embedded systems. 
Many embedded systems do not have a hard disk, a keyboard, a screen, or 
a mouse. There is effectively no device that needs to be supported by all 
variants of the OS, except maybe the system timer. Frequently, applications 
are designed to handle particular devices. In such cases, devices are not shared 
between applications, and hence there is no need to manage the devices by the 
OS. Due to the large variety of devices, it would also be difficult to provide all 
required device drivers together with the OS. Hence, it makes sense to decouple 
OS and drivers by using special processes instead of integrating their drivers into 
the kernel of the OS. Due to the limited speed of many embedded peripheral 
devices, there is also no need for an integration into the OS in order to meet 
performance requirements. This may lead to a different stack of software layers. 
For PCs, some drivers, such as disk drivers, network drivers, or audio drivers, are 
implicitly assumed to be present. They are implemented at a very low level of the 
stack. The application software and middleware are implemented on top of the 
application programming interface, which is standard for all applications. For an 
embedded OS, device drivers are implemented on top of the kernel. Applications 
and middleware may be implemented on top of appropriate drivers, not on top of 
a standardized API of the OS (see Fig. 4.2). Drivers may even be included in the 
application itself. 

e Protection mechanisms are sometimes not necessary, since embedded systems 
are sometimes designed for a single purpose (they are not supposed to support 


application software application software 
middleware middleware 
device driver | device driver OS kernel 
OS kernel device driver | device driver 


Fig. 4.2 Device drivers implemented on top of (left) or below (right) the OS kernel 
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the so-called multiprogramming). Untested programs have traditionally hardly 
ever been loaded. After the software has been tested, it could be assumed to be 
reliable. This also applies to input/output. In contrast to desktop applications, 
it is possibly not always necessary to implement I/O instructions as privileged 
instructions and processes can sometimes be allowed to do their own I/O. 
This matches nicely with the previous item and reduces the overhead of I/O 
operations. 


Example 4.1 Let switch correspond to the (memory-mapped) I/O address of 
some switch which needs to be checked by some program. We can simply use a 


load register,switch 


instruction to query the switch. There is no need to go through an OS service 
call, which would create overhead for saving and restoring the context (registers, 
etc.). V 


However, there is a trend toward more dynamic embedded systems. Also, 
safety and security requirements might make protection necessary. Special 
memory protection units (MPUs) have been proposed for this (see Fiorin [164] 
for an example). For systems with a mix of critical and non-critical applications 
(mixed-criticality systems), configurable memory protection [351] may be a 
goal. 

¢ Interrupts can be connected to any thread or process. Using OS service calls, 
we can request the OS to start or stop them if certain interrupts happen. We could 
even store the start address of a thread or process in the interrupt vector address 
table, but this technique is very dangerous, since the OS would be unaware of the 
thread or process actually running. Also composability may suffer from this: if 
a specific thread is directly connected to some interrupt, then it may be difficult 
to add another thread which also needs to be started by some event. Application- 
specific device drivers (if used) might also establish links between interrupts and 
threads and processes. Techniques for establishing safe links have been studied 
by Hofer et al. [218]. 

e Many embedded systems are real-time (RT) systems, and, hence, the OS used in 
these systems must be a real-time operating system (RTOS). 


Additional information about embedded operating systems can be found in a 
book chapter written by Bertolotti [51]. This chapter comprises information about 
the architecture of embedded operating systems, the POSIX standard, open-source 
real-time operating systems, and virtualization. 


4.1.2 Real-Time Operating Systems 


Definition 4.3 (A) “real-time operating system is an operating system that sup- 
ports the construction of real-time systems” [523]. 
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What is needed from an OS to be an RTOS? There are four key requirements:* 


¢ The timing behavior of the OS must be predictable. For each service of the 
OS, an upper bound on the execution time must be guaranteed. In practice, there 
are various levels of predictability. For example, there may be sets of OS service 
calls for which an upper bound is known and for which there is not a significant 
variation of the execution time. Calls like “get me the time of the day” may 
fall into this class. For other calls, there may be a huge variation. Calls like 
“get me 4MB of free memory” may fall into this second class. In particular, 
the scheduling policy of any RTOS must be deterministic. 

There may also be times during which interrupts must be disabled to avoid 
interferences between components of the OS. Less importantly, they can also 
be disabled to avoid interferences between processes. The periods during which 
interrupts are disabled must be quite short in order to avoid unpredictable delays 
in the processing of critical events. 

For RTOSs implementing file systems still using hard disks, it may be 
necessary to implement contiguous files (files stored in contiguous disk areas) 
to avoid unpredictable disk head movements. 

¢ The OS must manage the scheduling of threads and processes. Scheduling 
can be defined as mapping from sets of threads or processes to intervals of 
execution time (including the mapping to start times as a special case) and to 
processors (in case of multiprocessor systems). Also, the OS possibly has to be 
aware of deadlines so that the OS can apply appropriate scheduling techniques. 
There are, however, cases in which scheduling is done completely off-line and the 
OS only needs to provide services to start threads or processes at specific times 
or priority levels. Scheduling algorithms will be discussed in detail in Chap. 6. 

e Some systems require the OS to manage time. This management is mandatory 
if internal processing is linked to an absolute time in the physical environment. 
Physical time is described by real numbers. In computers, discrete time standards 
are typically used instead. The precise requirements may vary: 


1. In some systems, synchronization with global time standards is necessary. 
In this case, global clock synchronization is performed. Two standards are 
available for this: 


— Universal Time Coordinated (UTC): UTC is defined by astronomical 
standards. Due to variations regarding the movement of the Earth, this 
standard has to be adjusted from time to time. Several seconds have been 
added during the transition from 1 year to the next. The adjustments 
can be problematic, since incorrectly implemented software could get the 
impression that the next year starts twice during the same night. 

— International atomic time (in French: temps atomic internationale or 
TAI). This standard is free of any artificial artifacts. 


3This section includes information from Hiroaki Takada’s tutorial [523]. 
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Some connection to the environment is used to obtain accurate time informa- 
tion. External synchronization is typically based on wireless communication 
standards such as the Global Positioning System (GPS) [413], mobile net- 
works, or special atomic time services typically based on long wavelength 
stations [580], such as DCF77 in Germany. 

2. If embedded systems are used in a network, it is frequently sufficient to syn- 
chronize time information within the network. Local clock synchronization 
can be used for this. In this case, connected embedded systems try to agree on 
a consistent view of the current time. 

3. There may be cases in which provision for precise local delays is all that is 
needed. 


For several applications, precise time services with a high resolution must be 
provided. They are required, for example, in order to distinguish between original 
and subsequent errors. For example, they can help to identify the power plant(s) 
that are responsible for blackouts (see [427]). The precision of time services 
depends on how they are supported by a particular execution platform. They are 
very imprecise (with precisions in the millisecond range) if they are implemented 
through processes at the application level and very precise (with precisions in 
the microsecond range) if they are supported by communication hardware. More 
information about time services and clock synchronization is contained in a book 
by Kopetz [303]. 

e The OS must be fast. An operating system meeting all the requirements 
mentioned so far would be useless if it were very slow. Therefore, the OS must 
obviously be fast. 


Each RTOS includes a so-called real-time OS kernel. This kernel manages the 
resources which are found in every real-time system, including the processor, the 
memory, and the system timer. Major functions in the kernel include the process 
and thread management, interprocess synchronization and communication, time 
management, and memory management. 

While some RTOSs are designed for general embedded applications, others focus 
on a specific area. For example, OSEK/VDX-compatible operating systems focus on 
automotive control. Operating systems for a selected area can provide a dedicated 
service for that particular area and can be more compact than operating systems for 
several application areas. 

Similarly, while some RTOSs provide a standard API, others come with their 
own, proprietary API. For example, some RTOSs are compliant with the stan- 
dardized POSIX RT-extension [201] for Unix, with the OSEK ISO 17356-3:2005 
standard or with the ITRON specification developed in Japan (see http://www.ertl. 
jp/ITRON/). Many RT-kernel types of OSs have their own API. ITRON, mentioned 
in this context, is a mature RTOS which employs link-time configuration. 
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Fig. 4.3 Hybrid OSs real-time | real-time- non-real-time | non-real-time 
rocess 1 | process 2 process 1 process 2 
device driver |device driver Standard OS 

real-time kernel __ 


Available RTOSs can further be classified into the following categories [194]: 


e Fast proprietary kernels: According to Gupta, “for complex systems, these 
kernels are inadequate, because they are designed to be fast, rather than to be 
predictable in every respect”. Examples include QNX, PDOS, VCOS, VTRX32, 
and VxWorks. 

e Real-time extensions to standard OSs: In order to take advantage of com- 
fortable mainstream operating systems, hybrid systems have been developed. 
For such systems, there is an RT-kernel running all RT-processes. The standard 
operating system is then executed as one of these processes (see Fig. 4.3). 

This approach has some advantages: for example, the system can be equipped 
with a standard OS API and can have graphical user interfaces (GUIs) and 
file systems. Enhancements to standard OSs become quickly available in the 
embedded world as well. Also, problems with the standard OS and its non-RT- 
processes do not negatively affect the RT-processes. The standard OS can even 
crash and this would not affect the RT-processes. On the down side, and this is 
already visible from Fig. 4.3, there may be problems with device drivers, since 
the standard OS will have its own device drivers. In order to avoid interference 
between the drivers for RT-processes and those for the other processes, it may 
be necessary to partition devices into those handled by RT-processes and those 
handled by the standard OS. Also, RT-processes cannot use the services of 
the standard OS. So all the nice features like file-system access and GUIs are 
normally not available to those processes, even though some attempts may be 
made to bridge the gap between the two types of processes without losing the RT 
capability. RT-Linux is an example of such hybrid OSs. 

According to Gupta [194], trying to use a version of a standard OS is “not the 
correct approach because too many basic and inappropriate underlying assump- 
tions still exist such as optimizing for the average case (rather than the worst 
case), ...ignoring most if not all semantic information, and independent CPU 
scheduling and resource allocation.” Indeed, dependencies between processes 
are not very frequent for most applications of standard operating systems and 
are therefore frequently ignored by such systems. This situation is different for 
embedded systems, since dependencies between processes are quite common 
and they should be taken into account. Unfortunately, this is not always done 
if extensions to standard operating systems are used. Furthermore, resource 
allocation and scheduling are rarely combined for standard operating systems. 
However, integrated resource allocation and scheduling algorithms are required 
in order to guarantee meeting timing constraints. 
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e There is a number of research systems which aim at avoiding the above 
limitations. These include Melody [569] and (according to Gupta [194]) MARS, 
Spring, MARUTI, Arts, Hartos, and DARK. 


Takada [523] mentions low overhead memory protection, temporal protection of 
computing resources (targeting at preventing processes from computing for longer 
periods of time than initially planned), RTOSs for on-chip multiprocessors (espe- 
cially for heterogeneous multiprocessors and multi-threaded processors), support 
for continuous media, and quality of service control as research issues. 

Due to the potential growth in the Internet of Things (IoT) system market, 
vendors of standard OSs are offering variations of their products and obtain market 
shares from traditional vendors such as Wind River Systems [591]. Due to the 
increasing connectedness, Linux and its derivative Android® are becoming popular. 
Advantages and limitations of using Linux in embedded systems will be described 
in Sect. 4.4. 


4.1.3 Virtual Machines 


In certain environments, it may be useful to emulate several processors on a 
single real processor. This is possible with virtual machines executed on the bare 
hardware. On top of such a virtual machine, several operating systems can be 
executed. Obviously, this allows several operating systems to be run on a single 
processor. For embedded systems, this approach has to be used with care since the 
temporal behavior of such an approach may be problematic and timing predictability 
may be lost. Nevertheless, sometimes this approach may be useful. For example, we 
may need to integrate several legacy applications using different operating systems 
on a single hardware processor. A full coverage of virtual machines is beyond the 
scope of this book. Interested readers should refer to books by Smith et al. [502] 
and Craig [114]. PikeOS is an example of a virtualization concept dedicated toward 
embedded systems [520]. PikeOS allows the system’s resources (e.g., memory, 
I/O devices, CPU-time) to be divided into separate subsets. PikeOS comes with a 
small micro-kernel. Several operating systems, application programming interfaces 
(APIs), and run-time environments (RTEs) can be implemented on top of this kernel 
(see Fig. 4.4). 


Fig. 4.4 PikeOS 
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4.2 Resource Access Protocols 


In this section, we will be using the term job. 


Definition 4.4 A particular execution of a (possibly repeatedly executed) task is 
called a job. 


Compared to processes and threads used in operating systems, jobs can be seen as 
a more abstract view of required computations. During the design procedure, jobs 
will have to be mapped to entities handled by the operating system. A more precise 
definition will be provided in Definition 6.1. 


4.2.1 Priority Inversion 


There are cases in which jobs must be granted exclusive access to resources such as 
global shared variables or devices in order to avoid non-deterministic or otherwise 
unwanted program behavior. Such exclusive access is very important for embedded 
systems, e.g., for implementing shared memory-based communication or exclusive 
access to some special hardware device. Program sections during which such 
exclusive access is required are called critical sections. Critical sections should be 
short. Operating systems typically provide primitives for requesting and releasing 
exclusive access to resources, also called mutex primitives. Jobs not being granted 
exclusive access must wait until the resource is released. Accordingly, the release 
operation has to check for waiting processes and resume the job of highest priority. 

In this book, we will call the request operation or lock operation P(S) and the 
release or unlock operation V(S), where S corresponds to the particular resource 
requested. P(S) and V(S) are so-called semaphore operations. Semaphores allow up 
to n (with n being a parameter) threads or processes to use a particular resource 
protected by S concurrently. S is a data structure maintaining a count on how 
many resources are still available. P(S) checks the count and blocks the caller if 
all resources are in use. Otherwise, the count is modified and the caller is allowed 
to continue. V(S) increments the number of available resources and makes sure that 
a blocked caller (if it exists) is unblocked. The names P(S) and V(S) are derived 
from the Dutch language. We will use these operations only in the form of binary 
semaphores with n = 1, i.e., we will allow only a single caller to use the resource. 

For embedded systems, dependencies between processes are the rule, rather 
than an exception. Also, the effective job priority of real-time applications is 
more important than for non-real applications. Mutually exclusive access can lead 
to priority inversion, an effect which changes the effective priority of processes. 
Priority inversion exists on non-embedded systems as well. However, due to the 
reasons just listed, the priority inversion problem can be considered a more serious 
problem in embedded systems. 
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A first case of the consequences resulting from the combination of “mutual 
exclusion” with “no preemption” can be seen in Fig. 4.5. 

Bold upward pointing arrows indicate the times at which jobs are released 
or “ready”. At time tọ, job J2 enters a critical section after requesting exclusive 
access to some resource via an operation P. At time t, job J; becomes ready and 
preempts J2. At time t, J; fails getting exclusive access to the resource in use by 
Jz and becomes blocked. Job J2 resumes and after some time releases the resource. 
The release operation checks for pending jobs of higher priority and preempts J2. 
During the time J; has been blocked, a lower-priority job has effectively blocked a 
higher-priority job. The necessity of providing exclusive access to some resources 
is the main reason for this effect. Fortunately, in the particular case of Fig. 4.5, the 
duration of the blocking cannot exceed the length of the critical section of Jz. This 
situation is problematic but difficult to avoid. 

In more general cases, the situation can be even worse. This can be seen, for 
example, from Fig. 4.6. 

We assume that jobs J1, J2, and J3 are given. Jı has the highest priority, J2 has 
a medium priority, and J3 has the lowest priority. Furthermore, we assume that Jı 
and J3 require exclusive use of some resource via operation P(S). Now, let J3 be 
in its critical section when it is preempted by J2. When J; preempts J2 and tries 
to use the same resource that J3 is having exclusive access of, it blocks and lets J2 
continue. As long as J2 is continuing, J3 cannot release the resource. Hence, J2 is 
effectively blocking Jı even though the priority of J; is higher than that of J2. In 
this example, the blocking of J; continues as long as J? executes. Jı is blocked by a 
job of lower priority, which is not in its critical section. This effect is called priority 
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inversion.’ In fact, priority inversion happens even though J is unrelated to J; and 
J3. The duration of the priority inversion situation is not bounded by the length of 
any critical section. This example and other examples can be simulated with the levi 
simulation software [497]. 

A prominent case of priority inversion happened in the Mars Pathfinder, where 
exclusive use of a shared memory area led to priority inversion on Mars [276]. 


4.2.2 Priority Inheritance 


One way of dealing with priority inversion is to use the priority inheritance 
protocol (PIP). This protocol is a standard protocol available in many real-time 
operating systems. It works as follows: 


e Jobs are scheduled according to their active priorities. Jobs with the same 
priorities are scheduled on a first-come, first-served basis. 

e When a job Jı executes P(S) and exclusive access is already granted to some 
other job J2, then J; will become blocked. If the priority of Jz is lower than that 
of Jı, J2 inherits the priority of Jı. Hence, J2 resumes execution. In general, 
every job inherits the highest priority of jobs blocked by it. 

e When a job J2 executes V(S), its priority is decreased to the highest priority of 
the jobs blocked by it. If no other job is blocked by Jy, its priority is reset to the 
original value. The highest priority job so far blocked on S is resumed. 

e Priority inheritance is transitive: if J, blocks Jy and J, blocks Jz, then J, inherits 
the priority of Jz. 


This way, high-priority jobs being blocked by low-priority jobs propagate 
their priority to the low-priority jobs such that the low-priority jobs can release 
semaphores as soon as possible. 

In the example of Fig. 4.6, J3 would inherit the priority of J; when J; executes 
P(S). This would avoid the problem mentioned since Jz could not preempt J3 (see 
Fig. 4.7). 

Figure 4.8 shows an example of nested critical sections [81]. Note that the 
priority of job J3 is not reset to its original value at time fo. Instead, its priority 
is decreased to the highest priority of the jobs blocked by it, in this case it remains 
at priority pı of Jı. 

Transitiveness of priority inheritance is shown in Fig. 4.9 [81]. 

At time tọ, Jı is blocked by J2 which in turn is blocked by J3. Therefore, J3 
inherits the priority pı of J4. 

Priority inheritance is also used by Ada: during a rendezvous, the priority of two 
threads is set to their maximum. Priority inheritance also solved the Mars Pathfinder 


4Some authors do already consider the case of Fig. 4.5 as a case of priority inversion. This was also 
done in earlier versions of this book. 
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Fig. 4.9 Transitiveness of priority inheritance 


problem: the VxWorks operating system used in the pathfinder implements a flag 
for the calls to mutex primitives. This flag allows priority inheritance to be set to 
“on.” When the software was shipped, it was set to “off.” The problem on Mars 
was corrected by using the debugging facilities of VxWorks to change the flag to 
“on,” while the Pathfinder was already on Mars [276]. Priority inheritance can be 
simulated with the levi simulation software [497]. 

While priority inheritance solves some problems, it does not solve others. For 
example, there may be a large number of jobs having a high priority. There may 
also be deadlocks. The possible existence of deadlocks can be shown by means of 
an example [81]. Suppose that we have two jobs Jı and J2: 


e For job J; we assume a code sequence of the form ...; P(a); 
e For job J2 we assume a code sequence of the form ...; P(b); 


P 
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A possible execution sequence for these two jobs is shown in Fig. 4.10. 

We assume that the priority of J; is higher than that of J2. Hence, Jı preempts 
J at time tı and runs until it calls P(b), while b is held by J2. Hence, J2 resumes. 
However, it runs into a deadlock when it calls P(a). Such a deadlock would also exist 
if we were not using any resource access protocol. 


4.2.3 Priority Ceiling Protocol 


Deadlocks can be avoided with the priority ceiling protocol [485] (PCP). PCP 
requires jobs to be known at design time. With PCP, a job is not allowed to enter 
a critical section if there are already locked semaphores which could block it 
eventually. Hence, once a job enters a critical section, it cannot be blocked by lower- 
priority jobs until its completion. This is achieved by assigning a priority ceiling. 
Each semaphore S is assigned a priority ceiling C(S). It is the static priority of the 
highest-priority job that can lock S. 
PCP works as follows: 


e Let us assume that some job J is running and wants to lock semaphore S. Then, J 
can lock S only if the priority of J exceeds the priority ceiling C(S’) of semaphore 
© where S’ is the semaphore with the highest-priority ceiling among all the 
semaphores which are currently locked by jobs other than J. If such a semaphore 
exists, then J is said to be blocked by S’ and the job currently holding S’. When 
J gets blocked by S’, the job currently holding S’ inherits the priority of J. 

e When some job J leaves a critical section guarded by S, it unlocks S and the 
highest-priority job, if any, which is blocked by S is awakened. The priority of J 
is set to the highest priority among all the jobs which are still blocked by some 
semaphore which J is still holding. If J is not blocking any other job, then the 
priority of J is set to its normal priority. 


Figure 4.11 shows an example [59]. In this example, semaphores a, b, and c are 
used. The highest priority of a and b is pı, and the highest priority of c is p2. 

At time t2, J2 wants to lock c, but c is already locked. Furthermore, the priority 
of J2 does not exceed the ceiling of c. Nevertheless, the attempt to lock c results in 
an increase of the priority of J3 to p2. 
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Fig. 4.11 Locking with the priority ceiling protocol 


At time ts, Jı tries to lock a. a is not yet locked, but J3 has locked b and the 
current priority of J; does not exceed the ceiling for b. So, J; gets blocked. This is 
the key property of PCP: this blocking avoids potential later deadlocks. J3 inherits 
the priority of J4, reflecting that J; is waiting for the semaphore b to be released by 
J3. 

At time t6, J3 unlocks b. Jı is the highest-priority job so far blocked by b and 
now awakened. The priority of J3 drops to p2. Jı locks and unlocks a and b and 
runs to completion. At time f7, J2 is still blocked by c, and for all jobs with priority 
p2, J3 is the only one that can be resumed. At time tg, J3 unlocks c and its priority 
drops to p3. J2 is no longer blocked, it preempts J3 and locks c. J3 is only resumed 
after J2 has run to completion. 

Let us consider a second example, to be used later for comparison with an 
extended PCP. Figure 4.12 shows this second example [59]. The highest priority of 
all semaphores is the priority of J1. At time fy, there is a request by J3 for semaphore 
c, but the priority of J3 is lower than the ceiling for the already locked semaphore a, 
and J4 inherits the priority of J3. At time #3, there is a request for b, but the priority 
of J2 is again lower than for the ceiling of the already locked semaphore a, and J4 
inherits the priority of J2. At time fs, there is a request for a, but the priority of Jı is 
not exceeding the ceiling for a, and J4 inherits the priority of Jı. When J4 releases 
a, no semaphore is blocked and its priority drops to its normal priority. At this time, 
Jı has the highest priority and executes until it terminates. Remaining executions 
are determined by the regular priorities. 

It can be proven that PCP prevents deadlocks (see [81], Theorem 7.3). There are 
certain variants of PCP with different times at which the priority is changed. The 
Distributed Priority Ceiling Protocol (DPCP) [466] and the Multiprocessor Priority 
Ceiling Protocol (MPCP) [465] are extensions of PCP for multiprocessors. 
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4.2.4 Stack Resource Policy 


In contrast to PCP, the stack resource policy (SRP) supports dynamic priority 
scheduling, i.e., SRP can be used with dynamic priorities as computed by EDF 
scheduling (see Sect.6.2.1 on p. 306). For SRP, we have to distinguish between 
jobs and tasks. Tasks may be describing repeating computations. Each computation 
is a job in the sense the term has been used so far. The notion of tasks captures 
features that apply to a set of jobs, e.g., the same code which needs to be executed 
periodically. Accordingly, for each task t; there is a corresponding set of jobs. 
See also Definition 6.1 on p. 297. SRP does not just consider each job of a task 
separately but defines properties which apply to tasks globally. Furthermore, SRP 
supports multi-unit resources, for example, memory buffers. The following values 
are defined: 


e The preemption level l; of a task t; provides information about which tasks 
can be preempted by jobs of t;. A task t; can preempt some other task t; only if 
li > lj. We require that, if task q; arrives after tj and q; has a higher priority, then 
Tti must have a higher preemption level than t;. For sporadic EDF scheduling (see 
p. 316), this means that the preemption levels are ordered inversely with respect 
to the relative deadlines. The larger the deadline, the easier it is to preempt the 
job. l; is a static value. 

¢ The resource ceiling of a resource is the highest preemption level of the tasks that 
could be blocked by issuing their maximum request for units of this resource. The 
resource ceiling is a dynamic value which depends on the number of currently 
available resource units. 

° The system ceiling is the highest resource ceiling of all the resources which are 
currently blocked. This value is dynamic and changes with resource accesses. 


SRP blocks the job at the time it attempts to preempt, instead of the time at which 
it tries to lock: a job can preempt another job if it has the highest priority and its 
preemption level is higher than the system ceiling. A job is not allowed to start until 
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the resources currently available are sufficient to meet the maximum requirement of 
every job that could preempt it. 

Figure 4.13 demonstrates the difference between PCP and SRP by means of the 
example shown in Fig. 4.12 [59]. For SRP, at time żı there is no preemption since 
the preemption level is not higher than the ceiling. The same happens at t4. Overall, 
SRP has significantly less preemptions than PCP. This property has made SRP a 
popular protocol. 

SRP is called stack resource policy, since jobs cannot be blocked by jobs with a 
lower l; and can resume only when the job completes. Hence, jobs on the same level 
l; can share stack space. With many jobs at the same level, a substantial amount of 
space can be saved. 

SRP is also free of deadlocks (see Baker [34]). For more details about SRP, 
refer also to Buttazzo [81]. PIP, PCP, and SRP protocols have been designed for 
single processors. A first overview of resource access protocols for multiprocessors 
was published by Rajkumar et al. [466]. At the time of writing this book, there is 
not yet a standard resource access protocol for multi-cores (see Baruah et al. [41], 
Chapter 23). 


4.3 ERIKA 


Several embedded systems (such as automotive systems and home appliances) 
require the entire application to be hosted on small micro-controllers.> For that 
reason, the operating system services provided by the firmware on such systems 
must be limited to a minimal set of features allowing multi-threaded execution of 
periodic and aperiodic jobs, with support for shared resources to avoid the priority 
inversion phenomenon. 


5This section was contributed by G. Buttazzo and P. Gai (Pisa). 
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Such requirements have been formalized in the 1990s by the OSEK/VDX 
Consortium [18], which defined the minimal services of a multi-threaded real-time 
operating system allowing implementations of 1-10 kilobytes of code footprint 
on 8 bit micro-controllers. The OSEK/VDX API has been recently extended by 
the AUTOSAR Consortium [28] which provided enhancements to support time 
protection, scheduling tables for time triggered systems, and memory protection to 
protect the execution of different applications hosted on the same micro-controller. 
This section briefly describes the main features and requirements of such systems, 
considering as a reference implementation the open-source ERIKA Enterprise real- 
time kernel [157]. 

The first feature that distinguishes an OSEK kernel from other operating systems 
is that all kernel objects are statically defined at compile time. In particular, most of 
these systems do not support dynamic memory allocation and dynamic creation of 
jobs. To help the user in configuring the system, the OSEK/VDxX standard provides a 
configuration language, named OIL, to specify the objects that must be instantiated 
in the application. When the application is compiled, the OIL compiler generates the 
operating system data structures, allocating the exact amount of memory needed. 
This approach allows allocating only the data really needed by the application, to 
be put in flash memory (which is less expensive than RAM memory on most micro- 
controllers). 

The second feature distinguishing an OSEK/VDX system is the support for 
stack sharing. The reason for providing stack sharing is that RAM memory is 
very expensive on small micro-controllers. The possibility of implementing a stack 
sharing system is related to how the code is written. 

In traditional real-time systems, we consider the repetitive execution of code. A 
job corresponds to a single execution of the code. The code to be executed repeatedly 
is called a task. In particular, tasks may be periodically causing the execution of a 
job. The typical implementation of such a periodic task is structured according to 
the following scheme: 


task(x) { 
int local; 
initialization(); 
ier (as) t 
do_instance(); 
end_instance(); 
} 
3 


Such a scheme is characterized by a forever loop containing an instance (job) of 
the periodic task that terminates with a blocking primitive (end_instance()), which 
has the effect of blocking the task until the next activation. When following such 
a programming scheme (called extended task in OSEK/VDX), the task is always 
present in the stack, even during waiting times. In this case, the stack cannot be 
shared, and a separate stack space must be allocated for each task. 
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The OSEK/VDX standard also provides support for basic tasks, which are 
special tasks that are implemented in a way more similar to functions, according 
to the following scheme: 


int local; 
task x() { 
do_instance(); 


J 
System_initialization() { 
initialization(); 


} 


With respect to extended tasks, in basic tasks, the persistent state that must be 
maintained between different instances is not stored in the stack, but in global 
variables. Also, the initialization part is moved to system initialization, because 
tasks are not dynamically created, but they exist since the beginning. Finally, no 
synchronization primitive is needed to block the task until its next period, because 
the task is activated every time a new instance starts. Also, the task cannot call 
any blocking primitive; therefore it can either be preempted by higher-priority tasks 
or execute until completion. In this way, the task behaves like a function, which 
allocates a frame on the stack, runs, and then cleans the frame. For this reason, the 
task does not occupy stack space between two executions, allowing the stack to be 
shared among all tasks in the system. ERIKA Enterprise supports stack sharing, 
allowing all basic tasks in the system to share a single stack, so reducing the overall 
RAM memory used for this purpose. 

Concerning task management, OSEK/VDX kernels provide support for fixed 
priority scheduling with Immediate Priority Ceiling to avoid the priority inversion 
problem. The usage of Immediate Priority Ceiling is supported through the speci- 
fication of the resource usage of each task in the OIL configuration file. The OIL 
compiler computes the resource ceiling of each task based on the resource usage 
declared by each task in the OIL file. 

OSEK/VDX systems also support non-preemptive scheduling and preemption 
thresholds to limit the overall stack usage. The main idea is that limiting the 
preemption between tasks reduces the number of tasks allocated on the system stack 
at the same time, further reducing the overall amount of required RAM. Note that 
reducing preemptions may degrade the schedulability of the tasks set; hence the 
degree of preemption must be traded off with the system schedulability and the 
overall RAM memory used in the system. 

Another requirement for operating systems designed for small micro-controllers 
is scalability, which means supporting reduced versions of the API for smaller 
footprint implementations. In mass production systems, in fact, the footprint 
significantly impacts on the overall cost. In this context, scalability is provided 
through the concept of conformance classes, which define specific subsets of the 
operating system API. Conformance classes are also accompanied by an upgrade 
path between them, with the final objective of supporting partial implementation 
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of the standard with reduced footprint. The conformance classes supported by the 
OSEK/VDxX standard (and by ERIKA Enterprise) are: 


e BCCI: this is the smallest conformance class, supporting a minimum of eight 
tasks with different priority and one shared resource. 

e BCC2: compared to BCCI, this conformance class adds the possibility to have 
more than one task at the same priority. Each task can have pending activations, 
that is, the operating system records the number of instances that have been 
activated but not yet executed. 

e ECCI: compared to BCC1, this conformance class adds the possibility to have 
extended tasks that can wait for an event to appear. 

e ECC2: this conformance class adds both multiple activations and extended tasks. 


ERIKA Enterprise further extends these conformance classes by providing the 
following two conformance classes: 


e EDF: this conformance class does not use a fixed priority scheduler but an 
Earliest Deadline First (EDF) Scheduler (see Sect.6.2.1) optimized for the 
implementation on small micro-controllers. 

e FRSH: this conformance class extends the EDF scheduler class by providing a 
resource reservation scheduler based on the IRIS scheduling algorithm [380]. 


Another interesting feature of OSEK/VDX systems is that the system provides 
an API for controlling interrupts. This is a major difference when compared to 
POSIX-like systems, where interrupts are an exclusive domain of the operating 
system and are not exported to the operating system API. The rationale for this 
is that on small micro-controllers users often want to directly control interrupt 
priorities; hence it is important to provide a standard way to deal with interrupt 
disabling/enabling. Moreover, the OSEK/VDX standard specifies two types of 
Interrupt Service Routines (ISR): 


e Category 1: simpler and faster, does not implement a call to the scheduler at the 
end of the ISR 

e Category 2: this ISR can call some primitives that change the scheduling 
behavior. The end of the ISR is a rescheduling point. ISR1 has always a higher 
priority of ISR2. 


An important feature of OSEK/VDX kernels is the possibility to fine-tune the 
footprint by removing error-checking code from the production versions, as well as 
to define hooks that will be called by the system when specific events occur. These 
features allow for a fine-tuning of the application footprint that will be larger (and 
safer) when debugging and smaller in production when most bugs will be found and 
removed from the code. 

To support a better debugging experience, the OSEK/VDX standard defines a 
textual language, named ORTI, which describes where the various objects of the 
operating system are allocated. The ORTI file is typically generated by the OIL 
compiler and is used by debuggers to print detailed information about operating 
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system objects defined in the system (e.g., the debugger could print the list of the 
tasks in an application with their current status). 

All the features defined by the OSEK/VDX standard have been implemented 
in the open-source ERIKA Enterprise kernel [157], for a set of embedded micro- 
controllers, with a final footprint ranging between 1 and 5 kilobytes of object code. 
ERIKA Enterprise also implements additional features, like the EDF scheduler, 
providing an open and free-of-charge operating system that can be used to learn, 
test, and implement real applications for industrial and educational purposes. 


4.4 Embedded Linux 


Increasing requirements to the functionality of embedded systems, such as Internet 
connectivity (in particular for the Internet of Things) or sophisticated graphics 
displays, demand that a large amount of software is added to a typical embedded 
system’s simple operating system. It has been shown that it is possible to add 
some of this functionality to small embedded real-time operating systems, e.g., by 
integrating a small Internet protocol (IP) network stack [142]. However, integrating 
a number of different additional software components is a complex task and may 
lead to functional as well as security deficiencies. 

A different approach, enabled by the exponential growth of semiconductor 
densities according to Moore’s law, is the adaptation of a well-tested code base 
with the required functionality to run in an embedded context. Here, Linux® has 
become the OS of choice for a large number of complex embedded applications 
following this approach, such as Internet routers, GPS satellite navigation systems, 
network-attached storage devices, smart television sets, and mobile phones. These 
applications benefit from easy portability—Linux has been ported to more than 
30 processor architectures, including the popular embedded ARM, MIPS, and 
PowerPC architectures—as well as the system’s open-source nature, which avoids 
the licensing costs arising for commercial embedded operating systems. 

Adapting Linux to typical embedded environments poses a number of challenges 
due to its original design as a server and desktop OS. Below, we detail solutions 
available in Linux to tackle the most common problems that arise in its use in 
embedded systems. 


4.4.1 Embedded Linux Structure and Size 


Strictly speaking, the term “Linux” denotes only the kernel of a Linux-based operat- 
ing system. To create a complete, working operating system, a number of additional 


This section on Embedded Linux was contributed by M. Engel (NTNU Trondheim). 
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Fig. 4.14 Structure of typical Linux-based system 


components are required that run on top of the Linux kernel. A configuration for a 
typical Linux system, including system-level user mode components, is shown in 
Fig. 4.14. On top of the Linux kernel reside a number of—commonly dynamically 
linked—libraries, which form the basis for system-level tools and applications. 
Device drivers in Linux are usually implemented as loadable kernel modules; 
however, restricted user mode access to hardware is also possible. 

The open-source nature of Linux allows to tailor the kernel and other system 
components to the requirements of a given application and platform. This, in turn, 
results in a small system which enables the use of Linux in systems with restricted 
memory sizes. 

One of the essential components of a Unix-like system is the C library, which 
provides basic functionality for file I/O, process synchronization and communi- 
cation, string handling, arithmetic operations, and memory management. The libc 
variant commonly used in Linux-based systems is GNU libc (glibc). However, glibc 
was designed with server and desktop systems in mind and, thus, provides much 
more functionality than typically required in embedded applications. Linux-based 
Android® systems replace glibc with Bionic, a libc version derived from BSD 
Unix. Bionic is specifically designed to support systems running at lower clock 
speeds, e.g., by providing a tailored version of the Pthreads multithreading library 
to efficiently support Android’s Dalvik Java VM. Bionic’s size is estimated to be 
about half the size of a typical glibc version.’ 

Several significantly smaller implementations of libc exist, such as newlib, musl, 
uClibc, PDCLib, and dietlibc. Each of these is optimized for a specific use case; e.g., 


7The glibc-shared library size includes internationalization support. 
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libe version musl| uClibc} dietlibc glibc 
Static library size 426 kB| 500kB| 120kB| 2.0 MB 
Shared library size 527kB| 560kB| 185kB| 7.9 MB 
Minimal static C program size 1.8 kB 5kB| 0.2kB| 662 kB 
Minimal static “Hello, World” size 13 kB 70 kB 6kB| 662 kB 


Fig. 4.15 Size comparison of different Linux libc configurations 


musl is optimized for static linking, uClibe was originally designed for MMU-less® 
Linux systems (see below), whereas newlib is a cross-platform libc also available for 
a number of other OS platforms. Sizes of the related shared library binary files range 
from 185 kB (dietlibc) to 560 kB (uClibc), whereas the glibc binary is 7.9 MB in size 
(all numbers taken from x 86 binaries) according to a comprehensive comparison of 
different libc implementation features and sizes, compiled by Eta Labs.’ Figure 4.15 
gives an overview of the sizes of various libc variants and programs built using the 
different libraries. 

In addition to the C library, the functionality, size, and number of utility programs 
bundled with the OS can be adapted according to application requirements. These 
utilities are required in a Linux system to control system startup, operation, 
and monitoring; examples are tools to mount file systems, to configure network 
interfaces, or to copy files. As is the case for glibc, a typical Linux system includes 
a set of tools appropriate for a large number of use cases, most of which are not 
required on an embedded system. 

An alternative to a traditional set of diverse tools is BusyBox, a software that 
provides a number of simplified essential Unix utilities in a single executable 
file. It was specifically created for embedded operating systems with very limited 
resources. BusyBox reduces the overhead introduced by the executable file format 
and allows code to be shared between multiple applications without requiring a 
library. A comparison of BusyBox with alternative approaches to provide a small 
user mode tool set can be found in [531]. 


4.4.2 Real-Time Properties 


Achieving real-time guarantees in a system based on a general-purpose operating 
system kernel is one of the most complex challenges in adapting an OS to run 
in an embedded context. As shown above in Fig.4.3, one common approach is 
to run the Linux kernel and all Linux user mode processes as a dedicated task 
of an underlying RTOS, only to be activated when no real-time task needs to 
run. In Linux, competing approaches exist that follow this design pattern. RTAI 


8See Appendix C for an introduction to MMUs. 
° Available online at http://www.etalabs.net/compare_libes.html. 
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(real-time application interface) [138] is based on the Adeos hypervisor,!° which 
is implemented as a Linux kernel extension. Adeos enables multiple prioritized 
domains (one of which is the Linux kernel itself) to exist simultaneously on the 
same hardware. On top of this, RTAI provides a service API, for example, to 
control interrupts and system timers. Xenomai [182] was co-developed with RTAI 
for several years but became an independent project in 2005. It is based on its own 
abstract “nucleus” RTOS core, which provides real-time scheduling, timer, memory 
allocation, and virtual file handling services. Both projects differ in their aims and 
implementations. However, they share the support for the Real-Time Driver Model 
(RTDM), a method to unify interfaces for developing device drivers and related 
applications in real-time Linux systems. The third approach using an underlying 
real-time kernel is RTLinux [608], developed as a project at the New Mexico 
Institute of Mining and Technology and then commercialized at the company 
FSMLabs, which was acquired by Wind River in 2007. The related product was 
discontinued in 2011. The use of RTLinux in products was controversial, since its 
initiators vigorously defended their intellectual property, for which they obtained a 
software patent [607]. The decision to patent the RTLinux methods was not well 
received by the Linux developer community, leading to spin-offs resulting in the 
abovementioned RTAI and Xenomai projects. 

A more recent approach to add real-time capabilities to Linux, integrated into the 
kernel as of version 3.14 (2014), is SCHED_DEADLINE, a CPU scheduling policy 
based on the Earliest Deadline First (EDF) and Constant Bandwidth Server (CBS) 
[3] algorithms and supporting resource reservations. The SCHED_DEADLINE policy 
is designed to co-exist with other Linux scheduling policies. However, it takes 
precedence before all other policies to guarantee real-time properties. 

Each task t; scheduled under SCHED_DEADLINE is associated with a runtime 
budget C; and a period T;, indicating to the kernel that C; time units are required 
by that task every T; time units, on any processor. For real-time applications, 
T; corresponds to the minimum time elapsing between subsequent activations 
(releases) of the task, and C; corresponds to the worst case execution time needed 
by each execution of the task. On addition of a new task to this scheduling policy, 
a schedulability test is performed and the task is only accepted if the test succeeds. 
During scheduling, a task is suspended when it tries to run for longer than the 
pre-allocated budget and deferred to its next execution period. This non work- 
conserving strategy!! is required to guarantee temporal isolation between different 
tasks. Thus, on single-processor or partitioned multi-processor systems (with tasks 
pinned to a specific CPU), all accepted SCHED_DEADLINE tasks are guaranteed to 
be scheduled for an overall time equal to their budget in every time window as long 
as their period. 


!0See http://home.gna.org/adeos/. 


'lThis means that the processor may be idle even when tasks could be executed. A definition of 
the term can be found in Chap. 6 on p. 309. 
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Fig. 4.16 Structure of the JFFS2 inode content 


In the general case of tasks which are free to migrate on a multi-processor, as 
SCHED_DEADLINE implements global EDF (as described in detail in Sect. 6.3.3), 
the general tardiness bound for global EDF applies [128]. Benchmarks performed 
in [336] give an amount of missed deadlines of less than 0.2% when running 
SCHED_DEADLINE on a four-processor system with a utilization of 380% and 
0.615% with a utilization of 390%. The numbers cited for a six-processor system 
are of similar magnitude. Of course, no deadline misses occur on single-processor 
systems or multi-core systems with processes pinned to a fixed processor core. 


4.4.3 Flash Memory File Systems 


Embedded systems pose different requirements to permanent storage than server 
or desktop environments. Often, there is a large amount of static (read-only) data, 
whereas the amount of varying data is in many cases quite limited. 

Accordingly, file system storage can benefit from these special conditions. Since 
most of the read-only data in current embedded SoCs is implemented as flash ROM, 
optimization for this storage is an important aspect for the use of Linux in embedded 
systems. Accordingly, a number of different file systems specifically designed for 
using NAND-based flash storage have been developed. 

One of the most stable flash-specific file systems available is the log-structured 
Journaling Flash File System version 2 (JFFS2) [596]. In JFFS2, changes to files and 
directories are “logged” to flash memory in so-called nodes. Two types of nodes 
exist, inodes (shown in Fig.4.16), which consist of a header with file metadata 
followed by an optional payload of file data, and dirent nodes, which are directory 
entries each holding a name and an inode number. Nodes start out as valid when 
they are created and become obsolete when a newer version has been created in a 
different place in flash memory. JFFS2 supports transparent data compression by 
storing compressed data as inode payloads. 

However, compared to other log-structured file systems such as Berkeley lfs 
[473], there is no circular log. Instead, JFFS2 uses blocks, a unit the same size as 
the erase segment of the flash medium. Blocks are filled with nodes in a bottom-up 
manner one at a time, as shown in Fig. 4.17. 

Clean blocks contain only valid nodes, whereas dirty blocks contain at least one 
obsolete node. In order to reclaim memory, a background garbage collector collects 
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Fig. 4.17 Changes to flash when writing data to JFFS2 


dirty blocks and frees them. Valid nodes from dirty blocks are copies into a new 
block, whereas obsolete blocks are skipped. After copying, the dirty block is marked 
as free. The garbage collector is also able to consume clean blocks in order to even 
out the flash memory wear-leveling and prevent localized erasure of blocks in a 
mostly static file system, as is common in many embedded systems. 


4.4.4 Reducing RAM Usage 


Traditionally, Unix-like operating systems treat main memory (RAM) as a cache 
for secondary storage on disk, i.e., swap space [385]. While this is a useful 
assumption for desktop and server systems with large disks and equally large 
memory requirements, it results in a waste of resources for embedded systems, since 
programs which exist in a system’s non-volatile memory have to be loaded into 
volatile memory for execution. This commonly includes the rather large operating 
system kernel. 

To eliminate this duplication of memory requirements, a number of execute-in- 
place (XiP) techniques have been developed which allow the direct execution of 
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program code from flash memory, which is the common approach in most smaller, 
microcontroller-based systems. However, XiP techniques face two challenges. On 
the one hand, the non-volatile memory storing the executable code needs to support 
accesses in byte or word granularity. On the other hand, executable programs are 
commonly stored in a data format such as ELF, which contains meta information 
(e.g., symbols for debugging) and needs to be linked at runtime before execution. 

Support for XiP techniques is commonly implemented as a special file system, 
such as the Advanced XiP Filesystem (AXFS) [43], which provides compressed 
read-only functionality. The use of XiP is especially useful for the kernel itself, 
which would normally consume a large part of non-swappable memory. Running the 
kernel from flash memory would make more memory available for user-space code. 
XiP for user mode code itself is less useful, since the kernel only loads required text 
pages of an executable in virtual memory-enabled systems. Thus, RAM usage for 
program code is automatically minimized. 

Providing the byte- or word-granularity accesses required for XiP is mostly a 
question of cost in current systems. The commonly used NAND flash technology, 
as used in flash disks, SD cards, and SSDs, is inexpensive but only allows block- 
level accesses, similar to hard disks. NOR flash is a flash technique supporting 
random accesses; thus it is suitable for implementing XiP techniques. However, 
NOR flash tends to be an order of magnitude more expensive than NAND flash and 
is commonly somewhat slower than system RAM. As a consequence, equipping a 
system with more RAM instead of a large NOR flash and not using XiP techniques 
is a sensible design choice for most systems. 


4.4.5 uClinux: Linux for MMU-Less Systems 


One final resource restriction is apparent in low-end microcontroller systems, such 
as ARM’s Cortex-M series. The processor cores in these SoCs were developed for 
typical real-time OS scenarios, which often use a simple library OS approach, as 
described for ERIKA above. Thus, they lack crucial OS support hardware such as a 
paging memory management unit (see Appendix C). However, the large address 
space and relatively high clock speeds of these microcontrollers enable running 
a Linux-like operating system with some restrictions. Thus, uClinux was created 
as a derivative of the Linux kernel for MMU-less systems. Since kernel version 
2.5.46, uClinux support is available in the mainstream kernel source tree for a 
number of architectures including ARM7TDMI, ARM Cortex-M3/4/7/R, MIPS, 
M68k/ColdFire, as well as FPGA-based softcores such as Altera Nios II, Xilinx 
MicroBlaze, and Lattice Mico32. 

The lack of memory management hardware in uClinux-supported platforms 
comes with a number of disadvantages. An obvious drawback is the lack of memory 
protection, so any process is able to read and write other processes’ memory. The 
lack of an MMU also has consequences for the traditional Unix process creation 
approach. Commonly, processes in Unix are created as a copy of an existing process 
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using the fork() system call [470]. Instead of creating a physical copy in memory, 
which would require copying potentially large amounts of data, only the page table 
entries of the process executing fork() are replicated and point to physical page 
frames of the parent process. When the newly created process memory starts to 
differ from its parent due to data writes, only the affected page frames are copied on 
demand using a copy-on-write strategy. The lack of hardware support for copy-on- 
write semantics and the overhead involved in actually copying pages result in the 
fork() system call being unavailable in uClinux. 

Instead, uClinux provides the vfork() system call. This system call makes use of 
the fact that most Unix-style processes immediately call exec() after a fork to start 
a different executable file by overloading their memory image with text and data 
segments of that different binary: 


pid_t childPID; 
childPID = vfork(); 
if (childPID == @) { // in child process 
execl("/bin/sh", "sh", 0); 
J 
printf("Parent program running again, child PID is %d", childPID); 


The direct calling of exec() after vfork() implies that the complete address space 
of the newly created process will be replaced in any case and only a small part of the 
executable calling vfork() is actually used. In contrast to standard Unix behavior, vfork 
guarantees that the parent process is stopped after forking until the child process 
has called the exec() system call. Thus, the parent process is unable to interfere 
with the execution of the child process until the new program image has been 
loaded. However, some restrictions have to be observed to guarantee safe operation 
of vfork(). It is not permitted to modify the stack in the created child process, i.e., no 
function calls may be executed before exec. As a consequence, returning from vfork 
in case of an error, e.g., insufficient memory or inability to execute the new program, 
is impossible, since this would modify the stack. Instead, it is recommended to exit() 
from the child process in case of a problem. 

To summarize, uClinux is a way to use some Linux functionality on low- 
end, microcontroller-style embedded systems. However, the on-chip memory even 
in high-end microcontrollers is restricted to several hundreds of kB. A minimal 
uClinux version, however, requires about 8 MB RAM, so the addition of an external 
RAM chip is essential. For systems offering a smaller memory footprint, more 
traditional RTOS systems are still the more feasible solution. 


4.4.6 Evaluating the Use of Linux in Embedded Systems 


In addition to technical criteria, the decision whether to base an embedded system 
on Linux also has to consider legal and business questions. 
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On the technical side, Linux includes support for a large number of CPU 
architectures, SoCs, and peripheral devices as well as communication protocols 
commonly used in embedded applications, such as Internet protocol TCP/IP, 
CAN, Bluetooth® or IEEE802.15.4/ZigBee®. It provides a POSIX-like API that 
enables easy porting of existing code, not only written in C or C++ but also in 
scripting languages such as Python or Lua and even more specialized languages 
like Erlang. Linux development tools are available free of charge and can easily be 
integrated into development toolflows utilizing IDEs such as Eclipse and continuous 
integration testing services such as Jenkins. While in general, the Linux code base is 
well tested, the quality of support varies with the targeted platform. When utilizing 
a less common hardware platform, it is recommended to thoroughly investigate the 
stability of CPU and driver support. One drawback of using Linux is the inherent 
complexity of the large code base, requiring a good insight into and experience with 
the system to debug problems. However, a number of semiconductor manufacturers 
and third-party companies offer commercial support for embedded Linux, including 
the provisioning of complete board support packages (BSPs) for a number of 
reference designs. 

From a business perspective, the obvious benefit of using Linux is the availability 
of its source code free of cost. However, the GPL License version 2!? governing 
the kernel source code also requires that the source code for modifications to the 
existing code base is provided along with the binary code. This might jeopardize 
trade secrets of hardware components or violate non-disclosure agreements with 
hardware intellectual property owners. For some hardware, such as GPU drivers, 
this is circumvented by the inclusion of binary code “blobs” which are loaded 
by an open-source device driver stub. However, this approach is being actively 
discouraged by the Linux kernel developers. 

An increasingly serious problem is the security of embedded systems built on 
Linux, especially in the context of the Internet of Things. Many security problems 
affecting the Linux kernel also apply to embedded Linux. Inexpensive consumer 
devices, such as Internet-based cameras, routers, and mobile phones, rarely receive 
software updates but may be in active use for many years. This exposes them 
to security vulnerabilities which are already being actively exploited, e.g., for 
distributed denial-of-service attacks (DDOS) emanating from thousands of hijacked 
embedded Linux devices. As a consequence, the cost of continually updating 
devices in production as well as legacy devices in the field has to be considered 
in order to provide secure systems. 


12 See http://www.gnu.org/licenses/gpl-2.0.html. 
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4.5 Hardware Abstraction Layer 


Hardware abstraction layers (HALs) provide a way for accessing hardware through 
a hardware-independent application programming interface (API). For example, 
we could come up with a hardware-independent technique for accessing timers, 
irrespective of the addresses to which timers are mapped. Hardware abstraction 
layers are used mostly between the hardware and operating system layers. They 
provide software intellectual property (IP), but they are neither part of operating 
systems nor can they be classified as middleware. A survey over work in this area is 
provided by Ecker, Miiller, and Domer [145]. 


4.6 Middleware 


Communication libraries provide a means for adding communication functionality 
to languages lacking this feature. They add communication functionality on top of 
the basic functionality provided by operating systems. Due to being added on top 
of the OS, they can be independent of the OS (and obviously also of the underlying 
processor hardware). As a result, we will obtain communication-oriented cyber- 
physical systems. Such communication is needed for the Internet of Things (IoT). 
There is a trend toward supporting communication within some local system as well 
as communication over longer distances. The use of Internet protocols in general is 
becoming more popular. Frequently, such protocols enable secure communication, 
based on en- and decryption (see p. 196). The corresponding algorithms are a 
special case of middleware. 


4.6.1 OSEK/VDX COM 


OSEK/VDX® COM is a special communication standard for the OSEK automotive 
operating systems [441].!? OSEK COM provides an “Interaction Layer” as an 
application programming interface (API) through which internal communication 
(communication within one ECU) and external communication (communication 
with other ECUs) can be performed. OSEK COM specifies just the functionality of 
the Interaction Layer. Conforming implementations must be developed separately. 
The Interaction Layer communicates with other ECUs via a “Network Layer” 
and a “Data Link” layer. Some requirements for these layers are specified by 
OSEK COM, but these layers themselves are not part of OSEK COM. This way, 
communication can be implemented on top of different network protocols. 


13 QSEK is a trademark of Continental Automotive GmbH. 
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Fig. 4.18 Access to remote Client, ) Object 
objects using CORBA Stub tSkeleton 
IIOP-protocol 
ORB1 ORB2 


OSEK COM is an example of communication middleware dedicated toward 
embedded systems. In addition to middleware tailored for embedded systems, many 
communication standards developed for non-embedded applications can be adopted 
for embedded systems as well. 


4.6.2 CORBA 


CORBA® (Common Object Request Broker Architecture) [433] is one example 
of such adopted standards. CORBA facilitates the access to remote services. 
With CORBA, remote objects can be accessed through standardized interfaces. 
Clients are communicating with local stubs, imitating the access to the remote 
objects. These clients send information about the object to be accessed as well as 
parameters (if any) to the Object Request Broker (ORB; see Fig. 4.18). The ORB 
then determines the location of the object to be accessed and sends information 
via a standardized protocol, e.g., the IIOP protocol, to where the object is located. 
This information is then forwarded to the object via a skeleton, and the information 
requested from the object (if any) is returned using the ORB again. 

Standard CORBA does not provide the predictability required for real-time 
applications. Therefore, a separate real-time CORBA (RT-CORBA) standard has 
been defined [428]. A very essential feature of RT-CORBA is to provide end-to- 
end predictability of timeliness in a fixed priority system. This involves respecting 
thread priorities between client and server for resolving resource contention and 
bounding the latencies of operation invocations. One particular problem of real-time 
systems is that thread priorities might not be respected when threads obtain mutually 
exclusive access to resources. The priority inversion problem (see p. 212) has to be 
addressed in RT-CORBA. RT-CORBA includes provisions for bounding the time 
during which such priority inversion can happen. RT-CORBA also includes facilities 
for thread priority management. This priority is independent of the priorities 
of the underlying operating system, even though it is compatible with the real- 
time extensions of the POSIX standard for operating systems [201]. The thread 
priority of clients can be propagated to the server side. Priority management is 
also available for primitives providing mutually exclusive access to resources. The 
priority inheritance protocol just described must be available in implementations of 
RT-CORBA. Pools of pre-existing threads avoid the overhead of thread creation and 
thread construction. 
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4.6.3 POSIX Threads (Pthreads) 


The POSIX thread (Pthread) library is an application programming interface (API) 
to threads at the operating system level [37]. Pthreads are consistent with the IEEE 
POSIX 1003.1c operating system standard. A set of threads can be run in the 
same address space. Therefore, communication can be based on shared memory 
communication. This avoids the memory copy operations typically required for MPI 
(see Sect. 2.8.3 on p. 113). The library is therefore appropriate for programming 
multi-core processors sharing the same address space, and it includes a standard 
API with mechanisms for mutual exclusion. Pthreads use completely explicit 
synchronization [554]. The exact semantics depends on the memory consistency 
model used. Synchronization is hard to program correctly. The library can be 
employed as a back end for other programming models. 


4.6.4 UPnP and DPWS 


Universal Plug and Play (UPnP) is an extension of the plug-and-play concept of PCs 
toward devices connected within a network. Connecting network printers, storage 
space, and switches in homes and offices easily can be seen as the key target [438]. 
Due to security concerns, only data is exchanged. Code cannot be transferred. 

Devices Profile for Web Services (DPWS) aims at being more general than 
UPnP. “The Devices Profile for Web Services (DPWS) defines a minimal set of 
implementation constraints to enable secure Web Service messaging, discovery, 
description, and eventing on resource-constrained devices” [597]. DPWS specifies 
services for discovering devices connected to a network, for exchanging information 
about available services, and for publishing and subscribing to events. 

In addition to libraries designed for high-performance computing (HPC), several 
comprehensive network communication libraries can be used. These are typically 
designed for a loose coupling over Internet-based communication protocols. 

MPI (see p. 113), OpenMP (see p. 114), OSEK/VDX COM, CORBA, Pthreads, 
UPnP, and DPWS are special cases of communication middleware (software to be 
used at a layer between the operating system and applications). Initially, they were 
essentially designed for communication between desktop computers. However, 
there are attempts to leverage the knowledge and techniques also for embedded 
systems. In particular, MPI (Message Passing Interface) is designed for message 
passing-based communication, and it is rather popular. It has recently been extended 
to also support-shared memory-based communication. 

For mobile devices like smart phones, using standard middleware may be 
appropriate. For systems with hard time constraints (see Definition 1.8 on p. 10), 
their overhead, their real-time capabilities, and their services may be inappropriate. 
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4.7 Real-Time Databases 


Databases provide a convenient and structured way of storing and accessing infor- 
mation. Accordingly, data bases provide an API for writing and reading information. 
A sequence of read and write operations is called a transaction. Transactions may 
have to be aborted for a variety of reasons: there could be hardware problems, 
deadlocks, problems with concurrency control, etc. A frequent requirement is that 
transactions do not affect the state of the database unless they have been executed to 
their very end. Hence, changes caused by transactions are normally not considered 
to be final until they have been committed. Most transactions are required to be 
atomic. This means that the end result (the new state of the database) generated by 
some transaction must be the same as if the transaction has been fully completed or 
not at all. Also, the database state resulting from a transaction must be consistent. 
Consistency requirements include, for example, that the values from read requests 
belonging to the same transaction are consistent (do not describe a state which never 
existed in the environment modeled by the database). Furthermore, to some other 
user of the database, no intermediate state resulting from a partial execution of a 
transaction must be visible (the transactions must be performed as if they were 
executed in isolation). Finally, the results of transactions should be persistent. This 
property is also called durability of the transactions. Together, the four properties 
printed in bold are known as ACID properties (see the book by Krishna and Shin 
[310], Chapter 5). 

For some databases, there are soft real-time constraints. For example, time- 
constraints for airline reservation systems are soft. In contrast, there may also be 
hard constraints. For example, automatic recognition of pedestrians in automobile 
applications and target recognition in military applications must meet hard real-time 
constraints. The above requirements make it very difficult to guarantee hard real- 
time constraints. For example, transactions may be aborted various times before 
they are finally committed. For all databases relying on demand paging and on hard 
disks, the access times to disks are hardly predictable. Possible solutions include the 
main memory databases and predictable use of flash memory. Embedded databases 
are sometimes small enough to make this approach feasible. In other cases, it may 
be possible to relax the ACID requirements. For further information, see the book 
by Krishna and Shin as well as Lam and Kuo [319]. 
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Table 4.1 Set of jobs requesting exclusive use of resources 


Job Priority Arrival Run-time Printer Comm line 
TPP tv, P tP,C tv.c 

Ji 1 (high) 3 4 1 4 - - 

Jo 2 10 3 - - 1 2 

J3 3 5 6 = = 4 6 

J4 4 (low) 0 7 2 5 - - 


4.8 Problems 


We suggest solving the following problems either at home or during a flipped 
classroom session: 


4.1 Which requirements must be met for an embedded operating system? 


4.2 Which techniques can be used to customize an embedded operating system in 
the necessary way? 


4.3 Which requirements must be met for a real-time operating system? How do 
they differ from the requirements of a standard OS? Which features of a standard 
OS like Windows or Linux could be missing in an RTOS? 


4.4 How many seconds have been added at New Year’s Eve to compensate for the 
differences between UTC and TAI since 1958? You may search in the Internet for 
an answer to this question. 


4.5 Find processors for which memory protection units are available! How are 
memory protection units different from the more frequently used memory manage- 
ment units (MMUs)? You may search in the Internet for an answer to this question. 


4.6 Describe classes of embedded systems for which protection should definitely 
be provided! Describe classes of systems, for which we would possibly not need 
protection! 


4.7 Provide an example demonstrating priority inversion for a system comprising 
three jobs! 


4.8 Download the levi learning module leviRTS from the levi web site [497]. Model 
a job set as described in Table 4.1. 

tp p and tp c are the times relative to the start times, at which a job requests 
exclusive use of the printer or the communication line, respectively (called At P in 
levi). ty,p and ty c are the times relative to the start times at which these resources 
are released. Use priority-based, preemptive scheduling! Which problem occurs? 
How can it be solved? 


4.9 Which resource access protocols prevent deadlocks caused by exclusive access 
to resources? 


4.8 Problems 237 


4.10 How is the use of the system stack optimized in ERIKA? 


4.11 Which problems have to be solved if Linux is used as an operating system for 
an embedded system? 


4.12 Which impact does the priority inversion problem have on the design of 
network middleware? 


4.13 How could flash memory have an influence on the design of real-time 
databases? 
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Chapter 5 A 
Evaluation and Validation FECA 


During the design procedure, we have to check repeatedly whether or not the 
system under design is likely to perform its function and to satisfy all relevant 
design objectives. This is the purpose of validations and evaluations which must be 
performed during the design process. This chapter starts with a presentation of tech- 
niques for the evaluation of (partial) designs with respect to objectives. In particular, 
we consider (worst case) execution time, quality of results, thermal behavior, and 
dependability as objectives. We provide an introduction into fundamental techniques 
for computing the worst case execution time. Examples of energy models will 
be presented in order to demonstrate the need for an adjustment of the level of 
model details to the particular application at hand. Thermal modeling is reduced 
to the problem of equivalent electrical modeling. With respect to dependability, an 
introduction to statistical models of reliability as well as an introduction to fault 
trees are included. As a means for relating results for the different objectives against 
each other, we introduce the concept of Pareto optimality. This chapter closes with 
hints regarding validation techniques, including simulation, rapid prototyping, and 
formal verification. 


5.1 Introduction 


5.1.1 Scope 


Specification, hardware platforms, and system software provide us with the basic 
ingredients which we need for designing embedded systems. During the design 
process, we must validate and evaluate designs rather frequently. These activities 
can be defined as follows: 


Definition 5.1 Evaluation is the process of computing quantitative information of 
some key characteristics (or “objectives”) of a certain (possibly partial) design. 
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Fig. 5.1 Context of the current chapter 


Definition 5.2 Validation is the process of checking whether or not a certain 
(possibly partial) design is appropriate for its purpose, meets all constraints, and 
will perform as expected. 


Definition 5.3 Validation with mathematical rigor is called (formal) verification. 


Validation and evaluation are required at various phases during the design 
procedure (see Fig. 5.1). Validation and design should be intertwined and not be 
considered as two completely independent activities. Validation and evaluation, 
even though different from each other, are very much linked. Due to their impact, 
we will describe validation and evaluation before we talk about design steps. 


5.1.2 Multi-Objective Optimization 


Design evaluations will, in general, lead to a characterization of the design by 
several criteria, such as execution time, energy consumption, quality of results, 
thermal behavior, and dependability. Merging all these criteria into a single objective 
function (e.g., by using a weighted average) is usually not advisable, as this would 
hide some of the essential characteristics of designs. Rather, it is recommended to 
return to the designer a set of designs among which the designer can then select an 
appropriate design. Such a set should, however, only contain “reasonable” designs. 
Finding such sets of designs is the purpose of multi-objective optimization 
techniques. 

In order to perform multi-objective optimization, we do consider an m-dimen- 
sional space X of possible solutions of the optimization problem. These dimensions 
could, for example, reflect the number of processors, the sizes of memories, as well 
as the number and types of buses. For this space X, we define an n-dimensional 
function 


f(x) = (fix), ..., faŒx)) where x € X 
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Fig. 5.2 Pareto optimality: left, Pareto point; right, Pareto front 


which evaluates designs with respect to several criteria or objectives (e.g., cost and 
performance). Let F be the n-dimensional space of values of these objectives (the 
so-called objective space). Suppose that, for each of the objectives, some total order 
< and the corresponding < order are defined. In the following, we assume that the 
goal is to minimize our objectives. 


Definition 5.4 Vector u = (u1, ..., Un) € F dominates vector v = (v1, ..., Un) € 
F iff u is “better” than v with respect to at least one objective and not worse than v 
with respect to all other objectives: 


YVie{l,...n}iui <v A (5.1) 
Jje{l, nn}: uj < vj (5.2) 


Definition 5.5 Vector u € F is called indifferent with respect to vector v € F iff 
neither u dominates v nor v dominates u. 


Definition 5.6 A design x € X is called Pareto optimal with respect to X iff there 
is no design y € X such that u = f(x) is dominated by v = f(y). 


The previous definition defines Pareto optimality in the solution space. The next 
definition serves the same purpose in the objective space. 


Definition 5.7 Let S C F be a subset of vectors in the objective space. v € F is 
called a non-dominated solution with respect to S iff v is not dominated by any 
element € S. v is called Pareto optimal iff v is non-dominated with respect to all 
solutions F. 


Figure 5.2 highlights the different areas in an objective space with objectives O1 
and 02, relative to design point (1). 

The upper right area corresponds to designs that would be dominated by design 
(1), since they would be “worse” with respect to both objectives. Designs in the 
lower left rectangle would dominate design (1), since they would be “better” with 
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Fig. 5.3 Kiviat diagram: top (red), mid-range (green, dashed), and entry-level (blue) models 


respect to both objectives. Designs in the upper left and the lower right area are 
indifferent: they are “better” with respect to one objective and “worse” with respect 
to the other. Figure 5.2 (right) shows a set of Pareto points, i.e., the so-called Pareto 
front. 


Definition 5.8 Design space exploration (DSE) based on Pareto points is the 
process of finding and returning a set of Pareto optimal solutions to the designer, 
enabling the designer to select the most appropriate implementation. 


In order to visualize objectives in multiple dimensions, so-called radar charts, 
spider charts, or Kiviat diagrams can be used [579]. They are extensions of the type 
of diagram which we have used in Fig. 2.74 to multiple dimensions. 


Example 5.1 As shown in Fig. 5.3, we can compare several designs (e.g., of mobile 
phones) according to objectives similar to the ones presented in the next subsection. 

Minimization of all objectives is assumed. The top model minimizes most 
objectives, except for costs. For the entry level model, it is the other way around. V 


5.1.3 Relevant Objectives 


For servers and PCs, the average performance plays a dominating role. For 
embedded and cyber-physical systems, multiple objectives need to be considered. 
The following list explains if and where this objective is discussed in this book: 
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1. Average performance: Some comments on this objective will be made in 
Sect. 5.2. This objective is frequently computed from simulations, which will 
be introduced in Sect. 5.7. 

2. Worst case performance/real-time behavior: Some fundamental techniques 
for computing the worst case execution time (WCET) will be presented in 
Sect. 5.2.2. This will be complemented by an introduction to real-time calculus 
in Sect. 5.2.3. 

3. Quality metrics: Quality metrics will be presented in Sect.5.3. In addition, 
transformations between number systems are discussed in Sect. 7.1.5. 

4. Energy/power consumption: A brief overview of techniques for evaluating this 
objective will be presented in Sect. 5.4. 

5. Thermal models: An introduction to this topic will be presented in Sect. 5.5. 

6. Dependability: Dependability is the topic of Sect.5.6, with subsections on 
safety, security, and reliability. 

7. Electromagnetic compatibility: This objective will not be considered here. 

8. Testability: Costs for testing systems can be very large, sometimes larger even 
than production costs. Hence, testability should be considered as well, preferably 
already during the design. Testability will be discussed in Chap. 8. 

9. Cost: Cost in terms of silicon area or real money will not be considered here. 

10. Weight, robustness, usability, extendability, and environmental friendliness: 
These objectives will also not be considered. 


There are more objectives than the ones listed above. For example, we could use 
standards for the evaluation of software quality, like standards ISO/IEC 25022 [258], 
ISO/IEC 25023 [259], and ISO/EIC 25024 [257]. The next section presents some 
approaches for performance evaluation, focusing on the worst case performance. 


5.2 Performance Evaluation 


Performance evaluation aims at predicting the performance of systems. This is a 
major challenge (especially for cyber-physical systems) since we might need worst 
case information, rather than just average case information. Such information is 
necessary in order to guarantee real-time constraints. 


5.2.1 Early Phases 


Two different classes of techniques have been proposed for obtaining performance 
information already during early design phases: 


¢ Estimated cost and performance values: Quite a number of estimators have 
been developed for this purpose. Examples include the work by Jha and Dutt 
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[274] for hardware and Jain et al. [266] and Franke [167] for software. Generating 
sufficiently precise estimates requires considerable efforts. 

e Accurate cost and performance values: We can also use the real binary soft- 
ware code on a close-to-real hardware platform. This is only possible if interfaces 
to compilers exist. This method can be more precise than the previous one but 
may be significantly (and sometimes prohibitively) more time-consuming. 


In order to obtain sufficiently precise information, communication needs to be 
considered as well. Unfortunately, it is typically difficult to compute communication 
cost already during early design phases. 

Formal performance evaluation techniques have been proposed by many 
researchers. For embedded systems, the work of Thiele et al., Henia and Ernst et al., 
and Wilhelm et al. is particularly relevant (see, e.g., [210, 536] and [587]). These 
techniques require some knowledge of architectures. They are less appropriate for 
early design phases, but some of them can be used without knowing all details about 
target architectures. These approaches model real, physical time. 


5.2.2. WCET Estimation 


Scheduling of tasks requires knowledge about the duration of task executions, 
especially if meeting time constraints has to be guaranteed, as in real-time (RT) 
systems. The worst case execution time (WCET) is the basis for most scheduling 
algorithms. Some definitions related to the WCET are shown in Fig. 5.4. 


Definition 5.9 The worst case execution time (WCET) is the largest execution 
time of a program for any input and any initial execution state. 


Unfortunately, the WCET is extremely difficult to compute. In general, it is 
undecidable whether or not the WCET is finite. This is obvious from the fact 
that it is undecidable whether or not a program terminates. Hence, the WCET 
can only be computed for certain programs/tasks. For example, for programs 
without recursion, without while loops, and with loops having statically known 
iteration counts, decidability is not an issue. But even with such restrictions, it is 
usually practically impossible to compute the WCET exactly. The effect of modern 
processor architectures’ pipelines with their different kinds of hazards and memory 
hierarchies with limited predictability of hit rates is difficult to precisely predict 
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at design time. Computing the WCET for systems containing interrupts, virtual 
memory, and multiple processors is an even greater challenge. As a result, we must 
be happy if we are able to compute good upper bounds on the WCET. 

Such upper bounds are usually called estimated worst case execution times, or 
WCET gsr. Such bounds should have at least two properties: 


1. The bounds should be safe (WCET 57 > WCET). 
2. The bounds should be tight (WCET ¢s57-WCET « WCET). 


Note that the term “estimated” does not mean that the resulting times are unsafe. 

Sometimes, architectural features which reduce the average execution time but 
cannot guarantee to reduce WCET g¢s7 are completely omitted from the real-time 
designs (see p. 154). Computing tight upper bounds on the execution time may still 
be difficult. The architectural features described above also present problems for the 
computation of WCET gsr. The computation of such bounds is extremely difficult 
for multi-cores. In fact, potential conflicts might even cause multi-cores to have 
larger worst case bounds than the corresponding single cores. 


Definition 5.10 The best-case execution time (BCET) is the smallest execution 
time of a program, considering all feasible inputs and initial states. The BCET 57 
is a safe and tight lower bound on the execution time. 


Computing tight bounds from a program written in a high-level language such 
as C without any knowledge of the generated assembly code and the underlying 
architectural platform is impossible. Therefore, a safe analysis must start from real 
machine code. Any other approach would lead to unsafe results. 

We will study WCET estimation more closely, using a description of the tool aiT 
by R. Wilhelm [587]. The architecture of aiT is shown in Fig. 5.5. 

Consistent with our remark about the problems with high-level code, aiT starts 
from an executable object file comprising the code to be analyzed. From this code, a 
control flow graph (CFG) is extracted. Next, loop transformations are applied. These 
include transformations between loops and recursive function calls as well as virtual 
loop unrolling. This unrolling is called “virtual” since it is performed internally, 
without actually modifying the code to be executed. Results are represented in 
the CRL (control flow representation language) format. The next phase employs 
different static analyses. Static analyses read the AIP-file comprising designer’s 
annotations. These annotations contain information which is difficult or impossible 
to extract automatically from the program (e.g., bounds of complex loops). Static 
analyses include value analysis, cache analysis, and pipeline analyses. 

A value analysis computes enclosing intervals for possible values in registers 
and local variables. The resulting information can be used for control flow analysis 
and for data cache analysis. Frequently, values such as addresses are precisely 
known (especially for “clean” code), and this helps in predicting accesses to 
memories. 
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Fig. 5.5 Architecture of the aiT timing analysis tool 


The next step is cache and pipeline analysis. We will present a few details about 
the cache analysis. Suppose using an n-way set associative cache (see Fig. 5.6).! 
We consider that part of the cache (the row) corresponding to a certain index 
(shown in bold and blue in Fig. 5.6). We assume that eviction from the row is 
controlled by the least recently used (LRU) strategy.? This means that among all 
references for a particular index, the last n referenced memory blocks are stored in 
the row. We assume that the necessary LRU management hardware is available for 
each index and that each index is handled independently of other indexes. Under 
this assumption, all evictions for a particular index are completely independent 
of decisions for other indexes. This independence is extremely important, since it 
allows us to consider each of the indexes independently. 

Let us now consider a row and a particular index. Suppose that we have 
information about potential entries for each of the cache ways (columns). What 
will happen in case of an access to a particular index? First of all, let us consider 
the case of an access to a variable e known to be in the cache. After that access, that 
variable is known to be the youngest (see Fig. 5.7). Entries on the left are assumed 
to be younger than the ones on the right. 

Now, assume that we have an access to some variable (say c) which is not yet in 
the cache. This access will remove the oldest entry from the cache (see Fig. 5.8). 


l We assume that students are familiar with concepts of caches. 
Unfortunately, this strategy is typically not available for processors. 
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Fig. 5.6 Set associative cache (for n = 4) 
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Fig. 5.7 Access to variable e makes it the youngest 
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Fig. 5.8 Access to variable c causes eviction of f 


Furthermore, consider control flow joins. What do we know about the content of 
the partial cache after the join? 

We must distinguish between may- and must-information and the corresponding 
analysis. Must-analysis reveals the entries which must be in the cache. This 
information is useful for computing the WCET. May-analysis identifies the entries 
which may be in the cache. This information is typically used to conclude that 
certain information will definitely not be in the cache. This knowledge is then 
exploited during the computation of the BCET. 

As an example of must- and may-analysis, we consider must information at 
control flow joins. Figure 5.9 shows the corresponding situation. In Fig. 5.9, 
memory object c is assumed to be the youngest object for one path to the join and 
a is assumed to be the youngest object for the other path. The age of the other 
entries is defined accordingly. What do we know about the “worst” case after the 
join? A certain entry is guaranteed to be in the cache only if it is guaranteed to 
be in the cache for both paths. This means that the intersection of the memory 
objects defines the result of the must-analysis after the join. As a worst case, we 
must assume the maximum of the ages along the two paths. Figure 5.9 shows the 
result. This analysis uses sets of entries for each cache way. 

Now, consider may-analysis for control flow joins. Figure 5.10 depicts the 
situation. Some object being in the cache on either of the two paths to the join may 
be in the cache after the join. Hence, the set of objects which may be in the cache 


248 5 Evaluation and Validation 


Intersection+maximum age 


{c} | fe} | {a} | {a} 


> | | | {ach | td} 


{a} |} | {cf | td} 


Fig. 5.9 Must-analysis at program joins for LRU caches 
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Fig. 5.10 May-analysis at program joins for LRU caches 


after the join consists of the union of the objects that were in the cache before the 
join. As a best case, we use the minimum of the ages before the join. Figure 5.10 
shows the result. 

Static analyses also comprise pipeline analysis. Pipeline analysis has to compute 
safe bounds on the number of cycles required to execute code in the machine 
pipeline. Details of pipeline analysis are explained by Hahn et al. [196] and Thesing 
[534]. The result of static analyses consists of bounds on the execution times for 
each of the basic blocks of a program. Results are written to the PER-file shown in 
Fig. 5.5. 

aiT’s next phase exploits these bounds to derive WCET rs; values for the 
entire program, using an integer linear programming (ILP) model (see p. 393), 
comprising two types of information: 


e The objective function: In our application of ILP modeling, this function 
represents the overall execution time. This time is calculated as 


WCETesr= i fi (5.3) 
basic blocks 


where e; is the worst case execution time of basic block i (as computed during 
static analysis) and f; its worst case execution count. Only some of these counts 
can be determined automatically, and additional designer-provided information, 
e.g., about loop bounds, may be required. 

e Linear constraints: These reflect the structure of the control flow graph. 
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Fig. 5.11 Sample program: left: extended control flow graph; right: WCET gsr of basic blocks 


Example 5.2 Let us consider the simple code shown next: 


int main() { int i,j=0; 


} 


_Pragma("Loopbound min 100 max 100") 


for (i=0; i <100; i++) { 


} 


if (i<50) j+=i; 


else j+=(ix13) % 42; 


return j; 


/* hint for bound analysis */ 


Figure 5.11 (left) shows the control flow graph (CFG) corresponding to this small 
program. This graph is extended by additional start and exit nodes. Node _L1 reflects 
the for-testing, _L3 the if-testing, _L4 and _L5 the two cases of the if-statement, 
and _L6 its join operation. Variables x0 to x20 denote the number of executions 
of the blocks and the number of transitions between blocks. For example, we are 
transitioning from node main into node _L1 x6 times and are executing the target 
node x7 times. We assume that the analysis of the WCET for each of the basic 
blocks has resulted in the list shown on the right of Fig.5.11. The following is a 
partial list of the ILP constraints: 


Q1: 
02: 
03: 
04: 
05: 
06: 
07: 
08: 
09: 


21 
x7 
x7 
x7 
x7 
x0 
x2 
x2 


x2 + 27 x7 + 2 x11 + 2 x14 + 20 x16 + 13 x18 + 20 x19; /xobjectivex/ 


- x8 - x6 = Q; 
- x9 - x10 = Q; 
- 101 x9 >= ð; 
- 101 x9 <= Q; 
= x4 = @; 
- x4 = Q; 
- x6 = Q; 


/* Constraint for flow entering CFG node _L1 
/* Constraint for flow leaving CFG node _L1 

/* Constraint for lower loop bound of _L1 

/* Constraint for upper loop bound of _L1 

/* CFG Start Constraint 

/* Constraint for flow entering function _main 
/x Constraint for flow leaving CFG node _main 


*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
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Line 01 contains the cost function. All other lines model constraints reflecting the 
structure of the graph. Consider, for example, node _L1. Constraints for this node 
are shown in lines @2 and @3. The number of times that we are branching into the 
node (x6+x8) is equal to its number of executions (x7). The number of times that 
we are leaving from the node (x9+x1@) is also equal to its number of executions. 
Lines 4 and 05 reflect the number of loop iterations. This number is taken from 
the pragma in the code. Line 06 describes the fact that node start is executed exactly 
as many times as we are branching into the code. The other lines are reflecting the 
structure in a similar way. V 


The ILP problem can be solved with some standard ILP solver. Maximizing the 
objective function yields a safe upper bound on the WCET. 

This technique for modeling execution time is called implicit path enumeration 
(IPET) [343], since the problem of enumerating the potentially large number of 
execution paths is avoided. 

aiT visualizes the results as annotated control flow graphs. The designer could 
optimize the system under design by exploiting these graphs. Due to the pre- 
sented approach, aiT has limitations: preemption by other processes, interrupts, 
input/output, and direct memory transfers (DMA) are not supported. 

Only few approaches exist for the WCET analysis of multi-cores [264, 265, 286]. 
New probabilistic approaches [2] aim at complementing available methods. They 
are usually based on extreme value theory [123]. 


5.2.3 Real-Time Calculus 


WCET estimates allow us to predict the execution of some algorithm for a single 
input event. However, the overall goal is more comprehensive. Overall, we should 
make sure that our hardware platform is capable of processing streams of events 
in a timely manner (which may be important for some parts of the Internet of 
Things). 

This can be checked with Thiele’s real-time calculus (RTC). This calculus 
(RTC) is based on the description of the rate of incoming events.’ This description 
also includes fluctuations of this rate. Toward this end, the timing characteristics of 
a sequence (or stream) of events are represented by a tuple of arrival curves: 


a“(A),a/(A)ER>0,AER>E0 


These curves represent the maximal resp. the minimal number of events arriving 
within a time interval of length A. There are at most œ “(A) and at least a KA) 


3Our presentation of the real-time calculus is based on Thiele’s presentation in the book edited 
by Zurawski [536]. Resulting considerations at the system level have been called modular 
performance analysis (MPA). 
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Fig. 5.12 Arrival curves: left: periodic stream; right: periodic stream with jitter J 


events arriving within the time interval (t, t+ A) for all t > 0. Figure 5.12 shows the 
number of possibly arriving events for some possible models of arriving events. For 
example, in the case of periodic event streams with period T, there is a maximum 
of a single event happening in time interval (0, T). Similarly, there is an upper 
bound of two events within time interval (T, 2T). Now, let us consider the lower 
bound for time interval (0, T). There is possibly not a single event in this interval. 
Hence, the bound is zero. For time interval (T, 2T), there has to be at least one 
event. Therefore, the bound is one. So, for A = 0.57, there will be at least zero 
and at most one incoming event (see Fig. 5.12 (left)). In the case of periodic event 
streams with jitter J, these curves are shifted by this amount (see Fig. 5.12 (right)). 
The upper bound is shifted to the left; the lower bound is shifted to the right. The 
jitter is assumed not to be accumulating. 

We are using bars on top of symbols (like @) for all entities referring to incoming 
events. 

Available computational and communication service capacity can be described 
by service functions: 


B"(A), B'(A)ER>0,AER>E0 


These functions allow us to model situations in which the available service capacity 
is fluctuating. Figure 5.13 shows the communication capacity of some time division 
multiple access (TDMA) bus (see p. 176). Allocation is done periodically with a 
period of T. Bus arbitration allocates this bus during a time window s time units 
long. During this window, the bus achieves a bandwidth of b. The upper bound is 
obtained if the bus is allocated exactly at the time we are starting our observation. 
The transferred amount is then increasing linearly. The lower bound is obtained if 
the bus was just deallocated when we started our observation of length A. Then we 
must wait T — s time units until the bus gets allocated again. 

Separate methods are required to determine @ and £ for streams of (“external”) 
events arriving at the system to be modeled. Their computation is not part of RTC. In 
contrast, bounds for events generated within the system are derived by the calculus 
(see below). 


4We leave out the subtle discussion of discontinuities at A = n * T. 
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Fig. 5.13 Service functions for a TDMA bus 


Up till now, there is no information about the workload required by each 
of the incoming events. This workload is represented by additional functions 
y “(e),v l(e) € R > 0 for each sequence e of incoming events. This information 
can be derived from bounds on the execution time of code required for each of the 
events. Figure 5.14 shows an example of such functions. This example is based on 
the assumption that between three and four time units are required for processing 
a single event. Accordingly, the workload for a single event varies between three 
and four time units, the work load for two events varies between six and eight time 
units, etc. The dashed lines are not part of the function, since it is defined only for 
an integer number of events. The work load resulting from an incoming stream of 
events can now be easily computed. Upper and lower bounds are characterized by 
the functions 


a” (A) = y"(@"(A)) and (5.4) 
a! (A) = y'@' (A)) (5.5) 


There should be enough computational or communication capacity to handle 
this workload. The number of events which can be processed with the available 
computational capacity can be computed as 


B“(A) = (vy!) 1(B"(A)) and (5.6) 
BA) = (v") 1 (8'(A)) (5.7) 


Equations (5.6) and (5.7) use the inverse of functions y “ and y L to convert bounds 
on the available capacity (measured in real time units) into bounds measured in 
terms of the number of events that can be processed. 

Based on this information, it is possible to derive the properties of outgoing 
streams of events from incoming streams of events. Suppose the incoming stream 
is characterized by bounds [œ ', œ “]. We can then compute characteristics of the 
outgoing streams such as the corresponding bounds [œ la u'] of the outgoing 
stream of events and the remaining service capacity, available for other tasks. This 
remaining capacity is derived by transforming service curves [B ', B “] into service 
curves [B E B | (see Fig. 5.15). This remaining service capacity can be employed 
for lower-priority tasks to be executed on the same processor. 
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Fig. 5.15 Transformation of event stream and service capacities by real-time components 


According to Thiele et al., outgoing streams and remaining service capacities are 
bounded by the following functions [536]: 


Operators used in these equations are defined as follows: 


(fF 8 gt) = infocuat ft —w) + sw} 
(f 8 g)(t) = supgcuar f(t — u) + g(u)} 


(f Og) = supyrol ft + u) — 
(FQ 8)t) = infusol f+ u) — 


A denotes the minimum operator. 


g(u)} 
g(u)} 


(5.8) 
(5.9) 
(5.10) 
(5.11) 


(5.12) 
(5.13) 
(5.14) 
(5.15) 
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In essence, these equations characterize outgoing streams and capacities. These 
equations have been adopted from communications theory. Proofs regarding these 
equations are provided by Network Calculus [327]. The easiest way of using these 
equations is to download a MATLAB® toolbox [561]. 

The same theory also allows to compute the delay caused by the real-time 
components as well as the size of the buffer required to temporarily store incom- 
ing/outgoing events. This way, performance and other characteristics of the system 
can be computed from information about the components. 

A second performance analysis method has been proposed by Henia and Ernst 
et al. In this so-called SymTA/S approach [210], the different curves in Thiele’s 
approach are replaced by standard models of event streams such as periodic event 
streams, periodic event streams with random jitter, and periodic event streams with 
bursts. SymTA/S explicitly supports the combination and integration of different 
kinds of analysis techniques known from real-time research. 


5.3 Quality Metrics 


5.3.1 Approximate Computing 


Sometimes, computing the best possible output of some algorithm requires a 
significant amount of resources (in terms of computing time, energy, thermal 
headroom, etc.). For some applications, the best possible output is not actually 
needed, since minor degradations will possibly not even be recognized by users. 
This can be exploited in a resource-constrained environment in order to trade off 
the quality of the output against needed resources. A certain deviation of the actual 
output from the best possible output is accepted, for example, for lossy audio, video, 
and image encoding. This leads us to consider approximate computing. 


Definition 5.11 Computing which tolerates a certain deviation of generated output 
of some algorithm from the best possible result is called approximate computing 
[397]. 


With approximate computing, it is necessary to consider the quality of the 
generated output as one of the objectives. Unfortunately, it is not easy to evaluate 
the quality of some generated result, and several metrics can be used. 


5.3.2 Simple Criteria of Quality 


Some simple metrics can be applied whenever the true (or the best possible) output 
is known. Suppose that x1,..., Xn are n samples of some signal x in discrete time. 
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Furthermore, suppose that instead of the real (or the best possible) values x1, ..., Xn 
we measure or compute approximate values yj,..., Yn- 
Then, our first metric, the mean-squared error (MSE), is defined as follows: 


Definition 5.12 The mean-squared error (MSE) is defined as 
MSE(x, y) = iy 2 5.16 
x,y = ei) (5.16) 
i 


The second metric is the root-mean-squared error. 


Definition 5.13 The root-mean-squared error (RMSE) is defined as 


RMSE(x, y) = ew (5.17) 
i=l 


RMSE has the same dimension as the difference between the actual and the real 
value, but it should not be confused with the “average error” which is defined next: 


Definition 5.14 The mean absolute error (MAE) is defined as 
1 n 
MAE(x, y) ==} xi — vil (5.18) 
i=1 


For identical deviations of the measured signal y from real values x, the MAE is 
equal to the RMSE. However, the RMSE emphasizes large deviations between real 
and measured values (so-called outliers). 

The signal-to-noise ratio (SNR) was already defined on p. 142. Next, we define 
the peak signal-to-noise ratio, which is similar to the SNR. Let x be a signal, Xmax 
its maximum, and y its noisy approximation. 


Definition 5.15 The peak signal-to-noise ratio (PSNR) is defined as 


= en 
PSNR(x, y) = 1010810 ( spe (5.19) 
= 20 log; ( — 2 — (5.20) 
RMSE(x, y) 


The PSNR, just like the SNR, is measured in decibels (dB). 


The above values are easy to compute, but they are agnostic of the impression 
which humans might have of certain errors [315]. It is known that certain deviations 
between real and computed signal values are hardly noticed by humans. This is the 
foundation of lossy coding techniques such as MP3, JPEG, or digital TV standards. 
None of the metrics presented so far reflects the impression of deviations by humans. 
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Next, we will present the universal image quality index (UIQI) [562]. This index 
tries to capture changes in the structure of images, since the human eye is very 
sensitive to it. We will present the computation of this index for gray-scale images. 
Several values need to be computed [315]: 


1 n 
Mx =—)_ Xi (5.21) 
i=1 
1 n 
Uy =- yi (5.22) 
4 i=1 
2x My 
L(x, y) = —— (5.23) 
u + By 


Equations (5.21) and (5.22) compute the average brightness of each of the images, 
and these averages are used to compute ¢(x, y). For images of the same average 
brightness, (x, y) will be equal to 1. Otherwise, this value will be less than 1. 
Furthermore, we consider variances. Equations (5.24) and (5.25) compute the 
contrast of each of the images, and these averages are used to compute c(x, y): 


1 n 
“= | Ge 1) > Gi — ax)? (5.24) 
N i=1 
1 n 
Oy = m= Soi — fy)? (5.25) 
N i=l 
20,05 
Cs) = re (5.26) 


For images of the same average contrast, c(x, y) will be equal to 1. Otherwise, this 
value will be less than 1. Equation (5.27) computes the cross-correlation of the two 
images: 


1 n 
Oxy = — Di Hx) Oi — My) (5.27) 
i=] 
Ox,y 
s(x, y) = — (5.28) 
xVy 


Positive values of s(x, y) as computed from Eq. (5.28) correspond to a good 
correlation of the two images; negative values correspond to an inverse correlation. 
An overall quality index is then computed by Eq. (5.29): 
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2px by 20x0y Ox,y 
u? + Ls, o2 + o? OxOy 


Q(x, y) = 


(5.29) 


Q = | for identical images, and Q will be negative for inversely correlated images. 

It does not make sense to consider the correlation of images globally, since some 
inverse correlation in a particular block will already provide a negative impression 
about the image. Hence, Eq. (5.29) is computed only for blocks of pixels. The global 
UIQI value takes the values of Q for the different blocks into account. 

The structural similarity index measure (SSIM) [563] is an extension of the UIQI 
objective. 

Kühn compared the different metrics and found that none of these is really 
superior to others [315]. He recommends that several of these metrics should be 
computed and a careful comparison should be performed in practice. An overview 
over some useful objectives is also provided by Mittal [397]. 

In digital communications, the bit error ratio (BER) is an important metric. 


Definition 5.16 The bit error ratio (BER) is ratio of the number of bit errors 
divided by total number of communicated bits. 


5.3.3 Criteria for Data Analysis 


Sensors are typically not ideal in sense that some readouts deviate from the real 
values. Furthermore, it may be necessary to fuse data generated by various sensors. 
Hence, it is necessary to use data analysis techniques, e.g., machine learning (see 
p. 15). Generated results will not always be correct as well, either because sensor 
readouts were already compromised or due to imperfect data analysis techniques. In 
a way, we are dealing with approximate computing even though this term was not 
used in this context. 

For data analysis, classification of objects is a very frequent goal. Let X be a 
set of objects which we would like to classify. Suppose that we restrict ourselves to 
binary classification. 


Example 5.3 For example, consider the case of searching for amber at a beach. 
Unfortunately, white phosphorus as a leftover from bombs found, e.g., at the Baltic 
ocean, looks very much like amber but starts to suddenly burn at 1300°C when 
it dries. Classifying some found objects as either amber or phosphorus is thus a 
very delicate task (and hence, inexperienced people should not touch such objects 
anyway). V 


In this context, four cases are possible: 


° True positives (TP): we classify some object as amber, and it is actually valuable 
amber. 
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e False positive (FP): we classify some object as amber, and it is actually 
dangerous. 

e True negative (TN): we classify some object as dangerous and it is actually 
dangerous. 

e False negative (FN): we classify some object as dangerous, and it is actually 
valuable amber. 


Absolute numbers have to be related to each other. Hence, the following metrics 
have been defined: 


Definition 5.17 The precision p is defined as the fraction 


TP 


a (5.30) 
TP +FP 


P 


In the case of searching for amber, we aim at a precision of 1, since we do not 
want to get burnt. 


Definition 5.18 The recall r (or sensitivity) is defined as the fraction 


TP 


"= TP+FN 


(5.31) 
In order to obtain a good precision, we will have to accept some false negatives 


(e.g., amber classified as phosphorus). 


Definition 5.19 The accuracy acc is defined as the fraction 


E TP+TN 
— TP+FP+TN+FEN 


acc (5.32) 

In the case of searching for amber, we might tolerate a non-optimal accuracy, due 
to the importance of keeping false positives as close to zero as possible, and, hence, 
we might have several false negatives. 


Definition 5.20 The specificity is defined as the fraction 


ifici =e (5.33) 
specificity = ————— ; 
PeCpeny = TN 4 FP 

Definition 5.21 The F1 score or F-measure is defined as the harmonic mean of 
precision and recall: 


a 


p+r 


F1 (5.34) 


In a more general context, the quality of service (QoS) is another well-known 
metric. Frequently, it is related to the quality of communication channels, where bit 
error rates, latency, and bandwidth are indicators of quality. 
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In an even wider sense, we may also consider not just those technical parameters 
but also the overall experience for the user. This is captured in the quality of 
experience (QoE) metric, which refers to the overall user experience including all 
aspects which might be considered by a user. There is a number of metrics which 
can be used to estimate the overall quality of experience [400]. 


5.4 Energy and Power Models 


5.4.1 General Properties 


Energy models and power models are essential for evaluating the corresponding 
objectives. Such models are needed for optimizations aiming at a reduction of power 
and energy consumptions. They are also required for optimizations trying to reduce 
operating temperatures and to improve reliability. Power estimation is used in power 
management algorithms (see p. 373). 

Energy and power models are closely related, as can be seen from Eq. (3.13). 
Energy can be computed as the integral of power over time. Once the energy 
consumption is known, we can compute the average power consumption. In general, 
we can use: 


1. Measurements on real hardware: measurements can be very precise, but they 
apply only to the hardware at hands. Measuring voltages is typically rather easy 
and does not require complex procedures. 

Measuring currents can be done with a current clamp or a shunt resistor. 


e Current clamps have to enclose one of the wires of the power supply cable. 
They measure the magnetic field caused by the current flowing through the 
cable. The advantage of this approach is that no power wires have to be broken 
and power will remain connected unchanged to the device being analyzed. The 
disadvantage is that current clamps do not allow precise measurements. 

e Using an ammeter typically results in a better precision. However, an 
insertion of the ammeter directly into the power line has some disadvantages. 
For example, the system is unpowered if we remove the ammeter. Also, long 
cables might add noise. Therefore, it is typically preferable if we include a 
shunt resistor. A typical circuit containing a shunt is shown in Fig. 5.16 (left). 

The advantage of using a shunt resistor over using a simple ammeter is that 
the shunt can be integrated into the power wires. Due to the shunt resistor, 
currents flowing into the device under test will cause a voltage drop across 
the shunt, and this voltage can be measured and used to compute the current 
from Ohm’s law. Finding the right resistance of the shunt is an issue. If the 
resistance is too large, the device under test will be powered with a voltage 
lower than the original voltage and might even fail to work. If the resistance 
is too small, the voltage across the shunt will be too small to be reliably 
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Fig. 5.16 Measuring current: left, two-wire connection; right, feedback into voltage regulator 


measured and will be subject to a substantial amount of noise. Selecting the 
right resistance depends on the current flowing into the device under test. If 
this current varies substantially, it may even be necessary to employ several 
shunt resistors and switch between them, depending on the current actually 
flowing. The problem regarding the voltage drop can be partially avoided 
when regulated power supplies are used and the regulator feedback input 
can be connected to the voltage actually powering the device (see Fig. 5.16 
(right)). The power supply would then try to keep the voltage at the device 
at its nominal level. However, the voltage across the shunt is affected by the 
current flowing back into the voltage regulator input. 

Unfortunately, there will not be a separate power pin or wire for every 
component within the device and we can compute only a lumped sum of 
currents drawn by the device. We may have to stimulate the device in a 
particular way in order to get information about the consumption of the 
different components. 

e Models can be used even when real hardware is not available, but they 
can be very imprecise. Models have to be validated; otherwise they would 
remain very questionable. Two validation methods can be found for many of 
the available power and energy models: either models are validated against 
more detailed models at a lower level of abstraction, or they are compared 
with measurement for real devices, resulting in a hybrid model. Validation 
against measurements requires a method for selecting model parameters. 
Frequently, linear models are selected, and parameters are selected with using 
the least square method (minimizing the MSE as per Eq. (5.16)). Curve fitting 
with this method is typically available in mathematical tool boxes such as 
MATLAB®. More recently, using machine learning for this purpose became 
more preferable. For example, Falkenberg et al. [161] used machine learning 
for modeling the power consumption of transmitters in mobile phones. 


There is no one-approach-fits-all solution for energy consumption modeling. 
Instead, the usual approach is to combine ideas for modeling to fit the needs at hand. 
Therefore, we will present representative examples of power models and hope that 
the reader will identify the combination of methods which fits his/her constraints 
best. 
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5.4.2 Energy Model for Memories 


As described in the section on memory hardware (see p. 168), the energy con- 
sumption of caches and other memories can be computed with CACTI [408, 589]. 
CACTI assumes an abstract layout of the memory, extracts capacitances from the 
layout, and computes access times, cycle times, area, leakage, and dynamic power 
consumption from this information. CACTI has been validated against models of 
the same memories at a more detailed level, employing SPICE [519] as the solver 
at that level. Currently (in 2020), the most recent version of CACTI (version 6.5) 
is available from http://www.hpl.hp.com/research/cacti/.° Recent enhancements 
include detailed modeling of the interconnect and modeling of non-uniform memory 
accesses. Models of transmitters and sense amplifiers have been included. Also, 
used architectural and technological parameters can be specified. 


5.4.3 Energy Model for Instructions 


One of the first power models was proposed by Tiwari [542]. The model includes so- 
called base costs and inter-instruction costs. Base costs of an instruction correspond 
to the energy consumed per instruction execution if an infinite sequence of instances 
of that instruction is executed. Base costs have been computed by running programs 
consisting of 120 identical instructions and a branch back to the beginning of this 
sequence. Programs are designed such that no stall cycles appear. This may require 
the adding of no-operation instructions and some simple calculations to eliminate 
their contribution to the energy consumption. 

Inter-instruction costs model the additional energy consumed by the processor 
if instructions change. This additional energy is required, for example, due to 
switching functional units on and off. Inter-instruction costs reflect the impact of 
the initial circuit state on the overall energy consumption of an instruction. These 
costs can be computed by running programs containing an alternating sequence of 
instructions pairs. 

Base costs and inter-instruction costs are computed for a program not generating 
any cache misses. The effect of cache misses has to be added to these two costs. 
This requires the knowledge of the cache miss ratio and the memory access 
energy. The memory energy depends on the addresses accessed. No attempt is 
made to statically predict memory addresses. Hence, this contribution can only be 
determined dynamically, during the execution of the program. 

The model has been applied to two real systems, an Intel 486 DX2 and a Fujitsu 
SPARClite 934. Measurements of the currents have been used to calibrate the model. 


5It is recommended to use this URL, since there are several tools with the same name. Currently, a 
modifiable C++-version is available. Previously available web interfaces do not exist any longer. 
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5.4.4 Energy Model for Functional Processor Units 


The Wattch power estimation tool [70] estimates the power consumption of micro- 
processor systems at the architectural level. Wattch uses the SimpleScalar simulator 
to simulate processors. SimpleScalar can be configured to model the processor 
at hand as closely as possible. The number of pipeline stages and functional 
units is typically correctly modeled, whereas some more specialized features are 
possibly not. Wattch is based on detailed information on the energy consumption 
of the different components which we could find in a microprocessor. While 
running, SimpleScalar keeps track of invoked functional units. Wattch exploits this 
information in order to compute an overall energy consumption. 

Wattch requires much more information about the architecture than Tiwari’s 
instruction-set level approach. For example, Wattch includes its own detailed model 
of the energy consumption in memories. Also, clocking is taken explicitly into 
account, including conditional clocking if clock gating is used. In the original paper 
[70], results have been validated for three different processors. 


5.4.5 Energy Model for Processor and Memory 


The level of details of the model by Steinke et al. [510] lies between that of Tiwari 
and that of Wattch. For instructions and for data, the model considers the sum of the 
energies consumed in the CPU and the memory: 


Etotal = Ecpu_instr F Ecpu_data F Emem_instr =F Emem_data (5.35) 


Each of the four terms is then computed from detailed equations. The following 
notation is used in these equations: m is the number of instructions considered, w(b) 
returns the number of ones in its argument (either code or data), h(b;, b2) returns 
the Hamming distance between its two arguments, dir denotes the direction of data 
transfer, and a; and 6; (i € {1..10}) are constants computed from curve fitting of 
measured energies. Using this notation, Ecpu_data can be computed as follows: 


m 
Ecpu_data = Y {ts * w(DAddrj) + Bs * h(DAddri-1, DAddr;) 


i=l 
+06 dir * w(Data;) + B6,dir x h(Data;_1, Data;) } (5.36) 
where Data; is the data value used in instruction i, and D Addr; is its address. 


Furthermore, consider Ejmem_data, a term which is relevant only when the data is 
actually loaded from the main memory: 
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m 
Emem_data = 2 { BaseMem(DataMem, dir, Word_width) 
i=l 
+ag x w(DAddr;) + Bo x h(DAddr;_,, DAddr;) (5.37) 


+10,dir * W(Data;) + Bio,air * h(Datai—1, Data;) } 


where BaseMem is the base cost for accessing a memory object of a particular 
width in direction dir. 
Emem_instr can be computed as follows: 


m 
Emem_instr = > { BaseMem(Instr Mem, Word_width;) 
i=1 
+a7 x wU Addr;) + B7 * h(I Addr;_,, I Addr;) (5.38) 
+ag x wU Data;) + Bs * h(I Data;_1, I Data;) } 


where Base Mem is the base cost for accessing a memory word of a particular width 
from the instruction memory, J Addr; is the address of the instruction, and J Data; 
is instruction i itself. 

Ecpu_instr can be computed from the following equation: 


m 
Ecpu_instr = a { BaseC PU (Opcode;) + FUChange(Unstrj_1, Instr;) 
i=1 
+aq x wU Addr;) + B4 x h(I Addr;_,, I Addr;) 


Ss 
+) (a + w(mm; j) + Bi *h(mm,_,,;, Imm; j)) (5.39) 
j=l 


t 
+ >» (a * w(Regi k) + b2 * h(Regi—1,k, Regir)) 
k=1 


t 
+ 5 (a3 x w(RegVali k) + b3 * h(RegValj_1,x, RegValj.x))} 
k=1 


where BaseC PU is the base cost for Opcode;, FUChange(..) reflects the costs 
caused by the transition from instruction i — 1 to i, mm reflects the impact of up 
to s immediate values per instruction, Reg reflects the register numbers of up to t 
registers per instruction, and Reg Vai reflects up to t register values per instruction. 

To determine constants, dedicated code sequences have to be designed in order 
to attribute energy consumption to particular terms of the equations. 
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Example 5.4 The following code sequence allows measuring the energy required 
for executing a load word instruction: 


start: lw R1, address /* load word */ 
lw R1, address /* load word */ 
mere /x lw instruction repeated 50-100 times */ 
bra start /* back to the start */ 


The impact of the branch back to the beginning on the energy consumption can be 
neglected. The impact of different addresses, register numbers, and register content 
can be studied by varying these values. For example, we can initially set all these 
values to zero and then incrementally study the impact of additional ones. V 


In our own experiments, constants were determined by running a linear regres- 
sion method on the data. A significant impact of the number of ones in the data was 
found, which would have been unnoticed for Tiwari’s model. 


5.4.6 Energy Model for an Application 


The Odroid-XU3 [202] platform (see Fig. 5.17) comprises current sensors. The 
sensors enable precise measurement of the consumed power during the execution 
of applications, measuring the consumption of ARM® big cores, little cores, GPU, 
and DRAM individually. This possibility is exploited by several researchers. For 
example, Neugebauer et al. [416] have integrated Odroid-XU3 processors into their 
design space exploration for one application. Hence, design space exploration is 
based on a realistic analysis of the consumed energy. This approach eliminates 
the use of models of unknown precision. The overall approach for design space 
exploration enabled by the XU3 is shown in Fig. 5.18. 


Fig. 5.17 Odroid-XU3 
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Fig. 5.18 Evolutionary algorithm, fitness estimation based on real measurements 


The design space exploration is based on a genetic algorithm. The evaluation of 
a particular solution is based on real execution of the code on an XU3. The resulting 
optimized algorithm has been used by Neugebauer et al. [417] within the cyber- 
physical system PAMONO which is capable of detecting bio-viruses online. It is 
based on the physical so-called Plasmon effect of visualization of small objects. 
Unfortunately, the Odroid-XU3 has been discontinued and replaced by the XU4 not 
including current sensors. 


5.4.7 Energy Model for Multiple Applications with Hardware 
Multithreading 


Kerrison and Eder analyzed the energy consumption of the XMOS XS1-L multi- 
threaded processor design for real-time applications [290]. One of the particular 
features of that processor is its hardware-supported multithreading: it performs fast 
context switching between four threads in hardware. One of the research questions 
was: how much does the hardware context switching between threads cost? Due 
to the availability of real hardware, this question could be answered with real 
measurements. The power consumed by the XMOS XS1-L was measured with 
a shunt resistor inserted into its power cable, and the resistor was connected to 
an INA219 power measurement chip (see http://www.ti.com/product/ina219). The 
software running on the processor was controlled from a second processor. It turned 
out that the best energy efficiency was reached when all four hardware threads 
are used. However, hardware multithreading leads to many charging/discharging 
operations and a corresponding energy consumption. The interesting experimental 
results include an analysis of the impact of executed instructions on the energy 
consumption, as shown in Fig. 5.19 for the case of 8 bit data. 

Figure 5.20 displays the corresponding information for the case of 16 bit data. 

The two dimensions of the diagrams encode the applications which are run in 
the odd and even threads, respectively. In these figures, a change in the number of 
operands is indicated by dashed lines. Instructions with three or more operands are 
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Fig. 5.19 Power analysis for multithreading for 8 bit data, top, power consumption as a function 
of instructions on 8 bit data executed in the even threads (vertical axis) and in the odd threads 
(horizontal axis) ©Kerrison, Eder; bottom, color encoding of temperatures 


shown at the top and at the right end of each diagram. Obviously, the consumed 
energy increases with the number of operands. Figure 5.20 demonstrates that 
processing 16 bit data requires more energy than processing 8 bit data. Kerrison 
et al. use these results in order to optimize embedded software. 
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Fig. 5.20 Power analysis for multithreading for 16 bit data, power consumption as a function 
of instructions on 16 bit data executed in the even threads (vertical axis) and in the odd threads 
(horizontal axis); temperature encoding as in Fig. 5.19 (bottom) ©Kerrison, Eder 


5.4.8 Energy Model for an Android Mobile Phone 


Zhang et al. [612] describe a power model construction technique for an HTC 
Android phone, called PowerBooter. Their technique uses the following equation: 


E = (Bun * fregn + Bul * freq) * util + Bopu * CPUon 
+ßbr * brightness + BGon * GPS_on + Bgsi * GPS_sl 
+Bwiris* WiFi; + Bwiri_n * WiFin + B3G_idle * 3Gidle 


+63G_FACH * 3Gracu + B3G_pcH * 3GDCH (5.40) 
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where 


B.. : constants to be determined 
freq : CPU frequencies 
util : CPU utilization 
CPU» : refers to processor utilization 
brightness : takes illumination into account 
GPS. : relates to GPS usage 
WiFi, : amount of time, Wi-Fi is in low-speed mode 
WiFi, : amount of time, Wi-Fi is in high-speed mode 
3G3G_idle : amount of time, 3G is idle 
3GracH : amount of time, a shared 3G channel is used 


3GpcH : amount of time, a dedicated 3G channel is used 


Obviously, PowerBooter is abstracting much more from the details of the hardware 
implementation. Note that PowerBooter also includes communication, which was not 
taken into account in our previous models. Parameters are determined, as before, 
by measuring currents in dedicated setups and using some curve fitting method. 
Measurements are based on a Monsoon power monitor (see http://www.msoon.com/ 
LabEquipment/PowerMonitor/). 

The model construction technique allows, in combination with a battery model, 
a prediction of battery lifetime. The resulting information is made available to a 
tool called PowerTutor. PowerTutor is intended to provide some help for adjusting 
applications to different hardware platforms and as an aid for application developers 
to exploit power-saving techniques in their application without digging deep into the 
peculiarities of the available hardware. 

Another model for the energy consumption in mobile phones was presented by 
Dusza et al. [144]. Several commercial tools also provide power and/or energy 
estimation. 

All of the energy consumption models considered so far were designed to 
model an average case power or energy consumption, where term “average case” 
might still need some clarification. Computed models might apply only for certain 
inputs or for certain initial states. Average case results are valuable for predicting 
temperatures and battery lifetime for certain time intervals. 


5.4.9 Worst Case Energy Consumption 


In certain contexts, the worst case power consumption or worst case energy 
consumption is of interest. 
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Definition 5.22 The worst case energy consumption (WCEC) of an embedded 
system is defined as the largest energy consumption, computed as the maximum of 
the energy consumption for all inputs and initial states. 


Definition 5.23 The worst case power consumption (WCPC) of an embedded 
system is defined as the largest power consumption, computed as the maximum of 
the power consumption for all inputs and initial states. 


The WCPC is relevant in the context of the dimensioning of the interconnect 
and the power supply. The WCEC is relevant in the context of the design of battery 
systems. We need to guarantee that the chosen battery system meets the WCEC 
requirements. A safe upper bound on the WCEC can be computed as follows: 


WCET 
WCEC < 1 WCPC dt = WCET x WCPC 
0 


Techniques for tighter WCEC estimation have been proposed, for example, by 
Jayaseelan et al. [271], by Pallister et al. [443], and by Wägemann et al. [559]. 
Similar to the computation of worst case execution times, these tighter bounds may 
still be an overestimation, and the actual worst case power and energy consumption 
are still unknown. 


5.5 Thermal Models 


The quest for higher performances of embedded systems increased the chances of 
components becoming hot. Temperatures of the various components of embedded 
systems can have a serious impact on their usability, e.g., on sensor readouts. In the 
worst case, overheated components cause damages to other systems. For example, 
they may cause fire hazards. Hot components might also have other consequences, 
even in the absence of immediate failures. For example, the system life might be 
shortened, sometimes by large factors (see Black’s equation on p. 283). Also, it may 
be necessary to power down parts of silicon chips in order to avoid overheating. This 
has been called the dark silicon problem [153]. 

The thermal behavior of embedded systems is closely linked to the transforma- 
tion of electrical energy into heat. Therefore, thermal models are usually linked to 
energy models. Thermal models are based on the laws of physics.° 


6We will denote temperatures by 8 in order to avoid confusion with periods denotes by T. 
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Fig. 5.21 Plate of 
thickness L 


Table 5.1 Approximate thermal characteristics of materials for air, copper, and silicon 


K: thermal conductivity | cp: specific heat | cy: volumetric heat capacity 


Material (W/(K m)) IKK g)) JKK m3)) 

Air (25C) 0.025 [583] 1.012 [578] 1.21 * 103 [578] 
Copper 401 [583] 0.385 [568,578] | 3.45 * 10° [578] 
Silicon (726C) | 148 [148] 0.705 [148,568] | 1.64 * 10° [148]? 


*Calculated using Eq. (5.56) 


5.5.1 Steady-State Behavior 


Consider a homogeneous plate made of a particular material and of area A and 
thickness L (see Fig. 5.21). Suppose that there is a temperature difference of A0 
between the opposite sides. We assume that heat will be propagating independently 
of the direction (isotropy), and we assume being in the steady state (no transients). 
Furthermore, the sides of area are supposed to be much larger than the thickness of 
the plate, and we can ignore effects at the boundary of the plate. Then, the thermal 
power which gets transferred across the plate is equal to 


A0 * A 


Pin =K where: (5.41) 
P,n: thermal power transferred; «:thermal conductivity; A: area; A0: temperature 
difference; L: thickness 

Equation (5.41) is also known as Fourier’s law. 


Definition 5.24 Due to Eq. (5.41), we can define thermal conductivity « as the 
amount of the thermal power P;;, transferred through a plate made of some material 
of unit area and unit thickness when the temperatures at the opposite side differ by 
one temperature unit (typically Kelvin). 


Frequently, A is used instead of «. « depends on the material and environmental 
conditions. Values for some common materials for common conditions are included 
in Table 5.1. Refer to the cited sources for more information on the dependency on 
environmental conditions. 


Definition 5.25 Thermal conductance [169] is defined as the amount of thermal 
energy which passes through a plate per unit of time if the temperatures at the two 
ends differ by one unit of temperature (typically Kelvin). 


From Eq. (5.41), we have 
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Fig. 5.22 Thermal model of 
microprocessor with fan Rin, chip 
P AQ 
th 
Rth fan (Adan 
ground = reference temperature 
Pon a (5.42) 
— =K* — j 
A0 L 
The reciprocal of this value is called thermal resistance Rpp: 
Rin = = ee (5.43) 
a= P, th ~ K*A í 


Lemma 5.1 Thermal resistances add up like electrical resistances. This allows us 
to map thermal modeling to electrical modeling. 


Example 5.5 Figure 5.22 shows a microprocessor generating a thermal power Pj, 
together with the thermal resistance Rh die of the die (chip) and the thermal 
resistance Rsh, fan Of the fan. 

Adding resistances results in the following equations 


Rih = Rtn,die + Rth, fan (5.44) 
AO = Rin * Pin (5.45) 


Let us assume the following: 


Rih,die = 0.4 W/K (5.46) 
Rih, fan = 0.3 W/K (5.47) 
Pin = 10W (5.48) 
Then, we compute: 
A@ =7K (5.49) 
AO fan = 3K (5.50) 


Consumed power and thermal resistances are related to the estimation of the thermal 
design power. V 


Definition 5.26 ([584]) “The thermal design power (TDP), sometimes called ther- 
mal design point, is the maximum amount of heat generated by a computer chip 
or component (often the CPU or GPU) that the cooling system in a computer is 
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designed to dissipate in typical operation. Rather than specifying CPU’s real power 
dissipation, TDP serves as the nominal value for designing CPU cooling systems.” 


We could try to derive the TDP from the WCPC. In practice, however, published 
TDP values are typically smaller. Hence, temperature sensors are required in order 
to obtain a safe operation. 


5.5.2 Transient State Behavior 


So far, we have just considered the steady state. In general, transients and thermal 
capacitance (heat capacity) have to be considered. 


Definition 5.27 The thermal capacitance (heat capacity) of some object is 
defined as the amount of thermal energy E;n which can be stored per difference 
A@ in temperatures: 


E 
Cin = v7 (5.51) 


Primarily, C;;, depends on the amount and type of matter contained in the object: 
Cih = Cp *m (5.52) 
where cp is the specific heat and m the mass. We can also interpret Eq. (5.52) as the 


definition of the specific heat: 


Definition 5.28 The specific heat c,, of some object made of some material of mass 
m is defined as 


c 
Cp = T (5.53) 


Cp depends on the type of matter used. cp is temperature-dependent, but can be 
considered constant for small temperature ranges. 

In our context, it is frequently more convenient to consider the heat capacity per 
volume instead of per unit of mass. 


Definition 5.29 The volumetric heat capacity c, is defined as 


_ Cnh 


= (5.54) 


Cy 


where ‘V is the volume of the object. 
Cy and Cp are related by the mass density: 


Definition 5.30 The mass density or volume density p is defined as 
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P= aD 


Inserting V = m/p into the definition of cy, we have 


Cy = 


Cin _ Cin * p 


=Cp* 
y m eT 


273 


(5.55) 


(5.56) 


This allows us to convert between tables published for cp and c, (see, e.g., 
Table 5.1). Due to the correspondence to electrical circuits, we can also compute 
the transient behavior. 


Example 5.6 We extend our microprocessor example as shown in Fig. 5.23 (left). 
The resulting transient for the temperature across the die and the fan is shown in 

Fig. 5.23 (right). The system approaches the stable state like a network of resistors 

and capacitors. 


V 


Overall, it is feasible to model thermal behavior by using an equivalent electrical 
model. Equivalences are shown in Table 5.2. 
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Fig. 5.23 Microprocessor with fan: left, thermal model; right, transient 
Table 5.2 Equivalences between electrical and thermal models 
Electrical model Thermal model 
Current I Thermal flow, “power flow? | Pin = Ò 
Total charge Q= fIdt | Thermal energy En = f Pin dt 
Potential $ Temperature 0 
Voltage = potential difference | V = Ad Temperature difference A90 
Resistance R= pel L Thermal resistance Rih = 1 L 
Ohm’s law V=RxI A temperature at Rip A@ = Rin * Pin 
Capacitance C Thermal capacitance Cih 
Charge on capacitor Q=CxV | Energy at capacitance Ein = Cin * AT 
Capacitance of object” C= pY Capacitance of object Cin = CoV 


è bel is the specific electrical resistance or volume resistivity 


b pq is the volume charge density 
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Fig. 5.24 HotSpot model of a chip mounted on a heat spreader and a heat sink 


Well-known techniques for solving electrical network equations (see, e.g., Chen 
et al. [96]) apply. However, there is no component corresponding to inductances 
on the thermal side. This equivalence between thermal and electrical models is 
exploited in tools such as HotSpot [500]. Figure 5.24 shows a HotSpot model of a chip 
mounted on a heat spreader which in turn is mounted on a heat sink [499]. Skadron 
et al. [499] emphasize the fact that large temperature gradients can exist within a 
chip, a heat spreader, or a heat sink. Hence, it is important not to assume a uniform 
temperature for these parts. In Fig. 5.24, the chip is assumed to comprise three 
micro-architectural components with each component forming one thermal zone. 

The heat spreader and the heat sink are modeled as five zones each. One zone of 
the heat spreader is located beneath the chip, and four zones are located on the sides. 
Zones on the sides possess a trapezoid-like shape and are indicated by dotted lines. 
The same partitioning has been done for the heat sink. Zones in the center cannot 
be shown in Fig. 5.24; they are hidden. Otherwise, each of the zones is shown as 
a node in the equivalent network in Fig. 5.24. The ambient temperature is assumed 
to be homogeneous. Reonvection is the thermal resistance to the environment. It is 
connected to the five zones of the heat sink. Rys is thermal resistance between the 
heat spreader and the heat sink. The heat sink is also modeled as five zones. The one 
in the center is connected to the chip via Rsp. The heat source is actually not shown. 
For each of the zones, there is one thermal capacitance. Each of them models the 
difference in temperatures if compared to the environment. Accordingly, it is always 
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considered to be connected to the ground. Furthermore, for each of the zones, there 
is a pair of thermal resistors connecting adjacent zones. 

In their experiments, Skadron at al. have used the Wattch (see p. 262) power 
simulator as heat source. Microarchitectural simulators such as SimpleScalar can be 
used to drive Wattch. HotSpot contains mechanisms to create a system of partial 
differential equations for models such as the one in Fig.5.24. These equation 
systems are then solved using a Runge-Kutta equation solver. 

Skadron et al. found that it is necessary to consider different thermal zones. 
Furthermore, they found that power consumption has an impact on the temperature, 
but in order to really check whether thermal constraints are met, one needs to model 
temperature explicitly. Several power-saving optimizations had only a small impact 
on crucial temperatures. For example, register files tend to get hot. Saving power on 
memory references is of little help in this context and might even have a negative 
impact. 


Example 5.7 As an example of the results of thermal modeling, we consider 
an MPSoC of STMicroelectronics, comprising 64 P2012 cores [506]. Thermal 
modeling of this MPSoC has been performed with the 3D-ICE [24] tool. Relative 
temperatures for this MPSoC are shown in Fig. 5.25.’ High temperatures are shown 
in red and low temperatures in blue. 

The MPSoC contains four clusters, each including 16 cores. Each of the corners 
of the layout corresponds to one cluster. The 16 processors are located at the center 
of the clusters. Memories are located below and above the processors. Simulation 
confirms that the processors are hotter than the memories. The higher utilization 
of Fig. 5.25 (bottom) leads to higher temperatures. Detailed modeling of the layout 
avoided temperature overestimation. V 


Validation of thermal models requires precise temperature measurements [394]. 


5.6 Dependability and Risk Analysis 


Next, we are going to look at dependability and possible risks. 


5.6.1 Aspects of Dependability 


Embedded and cyber-physical systems (like other products) can cause damages to 
properties and lives. The fact that such systems are potentially safety-critical was 
already included in Table 1.2 on p. 18. Hence, in general, we have to take this fact 
into account. It is not possible to reduce the risk of damages to zero. The best that we 


TImages are included with permission of David Atienza (EPFL). Images were obtained as part of 
the cooperation between EPFL and STMicroelectronics in the FP7 EU Project titled: “PRO3D: 
Programming for Future 3D Architectures with Many Cores”. 
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Fig. 5.25 Thermal simulation results for MPSoC: 50% utilization 


can do is to make the probability of damages small, hopefully orders of magnitude 
smaller than other risks. Dependability comprises various aspects, most importantly 
safety and data security. These, in turn, contain aspects such as reliability and 
confidentiality. Designs must be evaluated with respect to these aspects. 


5.6.2 Security Analysis 


Security of embedded and cyber-physical systems was not seen as a serious issue 
when these systems were not electronically accessible from the outside. This has 
changed for systems which can be accessed through communication channels, 
and the two are now much more related, since security holes can cause physical 
malfunctions resulting in accidents. 

Security analysis needs to consider attacker models mentioned already in 
Sect. 3.8. This analysis needs to find out if attacks are feasible even without having 
physical access to the embedded system. If the system can be physically accessed, 
physical attacks must be considered as well. 
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Fig. 5.25 (continued): 50% utilization 


Furthermore, relationships between encryption and decryption protocols and 
achievable data rates must be analyzed, since it could easily happen that resource- 
constrained embedded devices do not provide the expected encryption and decryp- 
tion rates. 


5.6.3 Safety Analysis 


Damages should also be avoided, as much as possible, by designing safe systems. 
In practice, at best we can expect to design a system such that the probability of 
damages is orders of magnitude less than the probability of damages from other 
risks. 

Typically, the minimum requirement for manufacturing safety-related products 
is to be ISO 9001 compliant. This standard defines requirements for quality 
management systems in general. Requirements as per this standard include the 
following principles [254]: customer focus, leadership, engagement of people, 
process approach, improvement, evidence-based decision-making, and relationship 
management. The first four principles are more or less self-explaining. The improve- 
ment principle requires work to proceed in plan, do, check, and act (PDCA) cycles. 
The goal of planning includes establishing objectives and addressing risks and 
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opportunities. The goal of the do phase is to implement the plan. This should be 
followed by checking the results and taking actions to improve if necessary. 

For the design of safety-related systems, more specific guidelines have been 
developed and published as the IEC 61508 international standard [527]. Part 1 [232] 
of this standard defines standard techniques for technical systems in general. Part 
2 [233] specifies requirements for electrical/electronic/programmable electronic 
safety-related systems. Software requirements are listed in part 3 [234]. Parts 4 to 
6 contain less formal further recommendations. These standards assume that it is 
not feasible to design technical systems which always provide the expected service. 
Emphasis is placed on documented design procedures capable of tracing underlying 
reasons for incorrect decisions. 

In standard IEC 61508, a distinction is made between four different levels of 
risks, called safety integrity levels (SIL). For continuously operating devices, the 
standard specifies failure rates per hour of 1075 to 1076 for SIL-1, 1076 to 1077 for 
SIL-2, 1077 to 1078 for SIL-3, and 1078 to 107° for SIL-4 [581]. SIL-4 is difficult 
to achieve and typically requires redundant execution. Problems arise from the 
current trend toward mixed-criticality, which means that subsystems of different 
SIL-levels are implemented, for example, on the same multi-core processor. Proper 
shielding of the different levels of criticality is difficult. 

Standard IEC 61508 is expected to apply to several industries. There are specific 
extensions for specific industries. These consider, for example, the amount of time 
which is available for human interventions, the possibility of transitioning into a fail- 
safe mode, and the impact of malfunctions. For example, there is very little time to 
react if something goes wrong in a car. However, cars can usually be stopped and 
parked in a “fail-safe” mode and a safe place (with the exception of some tunnels, 
etc.). In contrast, there is usually some more time available in an airplane, but some 
safety-critical systems in an airplane cannot simply be turned off. 

MISRA-C defines rules to be followed when using the C programming language 
for safety-critical systems [396]. 

ISO 26262 [252] is a standard more tailored for the automotive industry. 

Standards IEC 62279 and CENELEC 50128 take the special situation for rail- 
based transportation into account [60]. 

For avionics, systems should comply with the Airworthiness Certification Spec- 
ifications FAR-CS 25.1309 “Equipment, Systems and Installations” and with AC- 
AMC 25.1309 “System design and analysis” [549]. This is extended for hardware by 
standard DO-254 and for software by standard DO-178B (“Software Considerations 
in Airborne Systems and Equipment Certification”) [163, 474], in Europe also called 
ED-12B. DO-178C is a follow-up standard for DO-178B. 

IEC 61511 [236] has been defined for applications in manufacturing, and IEC 
61513 [235] is a special standard for nuclear power plants. 

Allowed failures may be in the order of 1 failure per 10? hours of operation 
or even significantly less for highly safety-critical systems like nuclear power 
plants. This may be several orders of magnitude less than the failure rates of 
chips. Hence, Kopetz [303] stressed that the system as a whole must be more 
dependable than any of its parts and that safety requirements cannot come in as 
an afterthought but must be considered right from the beginning. Obviously, fault- 
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tolerance mechanisms must be used. Due to the low acceptable failure rate, systems 
are not 100% testable. Instead, safety must be shown by a combination of testing 
and reasoning. Abstraction must be used to make the system explainable using a 
hierarchical set of behavioral models. Design faults and human faults must be taken 
into account. 

In order to address these challenges, Kopetz proposed the following 12 design 
principles: 


1. Safety considerations may have to be used as the important part of the specifica- 
tion, driving the entire design process. 

2. Precise specifications of design hypotheses must be made right at the beginning. 
These include expected failures and their probability. 

3. Fault-containment regions (FCRs) must be considered. Faults in one FCR should 
not affect other FCRs. 

4. A consistent notion of time and state must be established. Otherwise, it will be 
impossible to differentiate between original and follow-up errors. 

5. Well-defined interfaces must hide the internals of components. 

6. It must be ensured that components fail independently. 

7. Components should consider themselves to be correct unless two or more other 
components pretend the contrary to be true (principle of self-confidence). 

8. Fault-tolerance mechanisms must be designed such that they do not create any 
additional difficulty in explaining the behavior of the system. Fault-tolerance 
mechanisms should be decoupled from the regular function. 

9. The system must be designed for diagnosis. For example, it has to be possible to 
identify existing (but masked) errors. 

10. The man-machine interface must be intuitive and forgiving. Safety should be 
maintained despite mistakes made by humans. 

11. Every anomaly should be recorded. These anomalies may be unobservable at 
the regular interface level. This recording should involve internal effects, since 
otherwise they may be masked by fault-tolerance mechanisms. 

12. Provide a never-give-up strategy. Embedded systems may have to provide 
uninterrupted service. The generation of pop-up windows or going off line is 
unacceptable. 


Definition 5.31 As system is resilient if internal or external changes of the 
assumptions made at design time will change the overall user experience only in 
a limited way. 


A system which is self-repairing would provide some level of resiliency. 
Resiliency is beyond the scope of this book. 


5.6.4 Reliability Analysis 


The design of dependable systems also requires an analysis of the reliability 
(the likelihood of initially correctly designed systems not to malfunction due to 
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some internal fault). This task is expected to become more important and more 
difficult in the future, since decreasing feature sizes of semiconductors will be 
resulting in a reduced reliability of semiconductor devices (see, e.g., http:// 
variability.org). Transient as well as permanent faults are expected to become more 
frequent. Shrinking feature sizes will also cause an increased variability among 
device parameters. Therefore, dependability analysis and fault-tolerant designs are 
becoming extremely important [179, 406]. Faults within semiconductors might lead 
to failures of the system. The terms faults, failures, and the related terms error and 
service were defined by Laprie et al. [29, 323]. 


Definition 5.32 “The service delivered by a system (in its role as a provider) is its 
behavior as it is perceived by its user(s); ... The delivered service is a sequence 
of the provider’s external states. ... Correct service is delivered when the service 
implements the system function.” 


Definition 5.33 “A service failure, often abbreviated here to failure, is an event 
that occurs when the delivered service of a system deviates from the correct service. 
...A service failure is a transition from correct service to incorrect service.” 


Definition 5.34 An error exists if one of the system’s states is incorrect and may 
lead to its subsequent service failure. 


Definition 5.35 “The adjudged or hypothesized cause of an error is called a fault. 
Faults can be internal or external of a system.” 


Some faults will not cause a system failure. 

As an example, we might consider a transient fault flipping a bit in memory. 
After this bit flip, the memory cell will be in error. A failure will occur if the system 
service is affected by this error. 

In line with these definitions, we will talk about failure rates when we consider 
systems that do not provide the expected system function. We will talk about faults 
whenever we consider the underlying reasons that might cause failures. There are 
a large number of possible reasons for faults, some of them resulting from reduced 
feature sizes of semiconductors. Errors will not be considered in the remaining part 
of this book. 

Reaching a level of dependability corresponding to SIL-4 is only feasible if 
design evaluation also comprises the analysis of the reliability, the expected lifetime, 
and related objectives. Such an analysis is usually based on the probability of 
failures. 

More precisely, we consider the probability densities of failures. Let x be the time 
until the first failure. x is a random variable. Let f(x) be the probability density of 
this random variable. 

As an example, we are frequently using the exponential probability density 
f(x) = Ae”. For this density function, failures are becoming less and less likely 
over time (after some time, it is likely that the system is not working anymore and 
a system which is not working cannot fail). This density function is frequently used 
since it has a constant failure rate and, hence, describes in an appropriate way cases 
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we is 


Fig. 5.26 Exponential distribution: left, density function; right, probability distribution 


for which the failure rate is constant. We might even use this density function when 
the actual failure rate is unknown since a constant failure rate may be a good starting 
point. Moreover, this density function has nice mathematical properties. Figure 5.26 
(left) shows this density function. 

The probability distribution is frequently more interesting than the density. This 
distribution represents the probability of a system not working at time t. It can be 
obtained by integrating the density function until time t. 


F(t) = Pr(x <t) (5.57) 


t 
F(t) = 1 fœ)dx (5.58) 
0 


For example, for the exponential distribution, we obtain: 
t 
F(t) = f re dx = —[e™™]} = 1 — e™ (5.59) 
0 


Figure 5.26 (right) contains the corresponding function. As time advances, this 
probability approaches 1. This means that, as time progresses, it becomes more 
likely that the system will have failed. 


Definition 5.36 The reliability R(t) of a system is the probability of the time until 
the first failure being larger than t: 


R(t)= Pr(x >t),t>0 (5.60) 
R(t) = 1 f (x)dx (5.61) 
t 
t [0,6] 
F(t) + R(t) =) f (x)dx +f f(x)dx = 1 (5.62) 
0 t 
R(t)=1— F(t) (5.63) 
E i 5.64 
fœ) =- qr (5.64) 


For the exponential distribution, we have R(t) = e (see Fig. 5.27). 
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Fig. 5.27 Reliability for Ri) 
exponential distribution l 
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The probability for the system to be functional after time t = 1/A is about 37%. 


Definition 5.37 The failure rate A(t) is the probability of a system failing between 
time ¢ and time t + At. 


. Prit<x<t+At|x >t) 
A(t) = lim (5.65) 
At—0 At 


Pr(t <x < t + At|x > t) is the conditional probability for the system failing 
within this time interval provided that it was working at time ¢. For conditional 
probabilities, there is the general equation Pr(A|B) = Pr(AB)/Pr(B), where 
Pr(AB) is the probability of A and B happening. Pr(AB) is equal to F(t + At) — 
F(t) in our case. Pr(B) is the probability of the system working at time t, which is 
R(t) in our notation. Therefore, Eq. (5.65) leads to: 


FQ@+A)—F) FO 


A(t) = li = 5.66 
OS n AR RO) (%69) 
For example, for the exponential distribution, we obtain:® 
fit) ae 
At) = — = —— = 5.67 
O= 2p = ok (5.67) 


Failure rates are frequently measured as multiples (or fractions) of 1 FIT, where 
“FIT” stands for Failure unIT and is also known as Failures In Time. 1 FIT 
corresponds to 1 failure per 10° hours. 

However, failure rates of real systems are frequently not constant. For many 
systems, we have a “bath tub curve’’-like behavior (see Fig. 5.28). 

For this behavior, we are starting with an initially larger failure rate. This higher 
rate is a result of an imperfect production process or “infant mortality.” The rate 
during the normal operating life is then essentially constant. At the end of the useful 
product life, the rate is then increasing again, due to wear-out. 


8This result motivates denoting the failure rate and the constant of the exponential distribution with 
the same symbol. 
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Fig. 5.28 Bath tub curve- A(t) 
like failure rates 


| 
l 2nd phase | 3rd phase 
t t >t 


Ist phase 


Definition 5.38 The mean time to failure (MTTF) is the average time until the 
next failure, provided that the system was initially working. This average can be 
computed as the expected value of random variable x: 


MTTF = E{x} = [stom (5.68) 
0 


For example, for the exponential distribution, we obtain: 


[0,0] 


MTTF = / xhe** dx (5.69) 
0 


This integral can be computed using the product rule (f uv’ = uv — f u'v where 
in our case we have u = x and v’ = Ae~**). Therefore, Eq. (5.69) leads to the 
following equation: 


CO 
MTTF = —[xe™*]%° + f e dx (5.70) 
0 
Lo iro 1 1 
= — [e™]® = --[0— 1] = Ż 5.71 
wG 10 | ] a (5.71) 


This means that, for the exponential distribution, the expected time until the next 
failure is the reciprocal value of the failure rate. 

There is the following empirical relationship between MTTF and operating 
temperatures: 


Lemma 5.2 (Black’s equation [49, 55]) 


a 


Å E 
MTTF = —e x (5.72) 
n 
A 
where 
A : constant 
je : current density 


n : constant (1..7), controversial, 2 according to Black 
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Fig. 5.29 Illustration of MTTF, MTTR, and MTBF 


Ea : activation energy (e.g., ~ 0.6 eV) 
k : Boltzmann constant (~ 8.617 * 1075 eV/K) 


0 : temperature 


Regardless of discussions about the correct value of n, this equation shows that 
the temperature has an exponential impact on the MTTF. Furthermore, current 
densities are also important: the larger the current densities, the shorter the lifetime 
of the product. 


Definition 5.39 The mean time to repair (MTTR) is the average time to repair a 
system, provided that the system is initially not working. This time is the expected 
value of the random variable denoting the time to repair. 


Definition 5.40 The mean time between failures (MTBF) is the average time 
between two failures. 


MTBF is the sum of MTTF and MTTR: 
MTBF = MTTF + MTTR (5.73) 


Figure 5.29 shows a simplistic view of this equation: it is not reflecting the fact 
that we are dealing with probabilistic events, and actual MTBF, MTTF, and MTTR 
values may vary randomly. For many systems, repairs are not considered. Also, if 
they are considered, the MTTR should be much smaller than the MTTF. Therefore, 
the terms MTBF and MTTF are frequently mixed up. For example, the lifetime 
of a hard disk may be quoted as a certain MTBF, even though it will never be 
repaired. Quoting this number as the MTTF would be more correct. Still, the MTTF 
provides only very rough information about dependability, especially if there are 
large variations in the failure rates over time. 


Definition 5.41 The availability is the probability of a system being in an opera- 
tional state. 


The availability varies over time (just consider the bath tub curve!). Therefore, 
we can model availability by a time-dependent function A(t). However, we are 
frequently only considering the availability A for large time intervals. Hence, we 
define 
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Fig. 5.30 Failure rates of TriQuint’s gallium arsenide devices (courtesy of TriQuint, Inc., Hills- 
boro), ©TriQuint 


MTTF 


—____ (5.74) 
MTBF 


A= lim A(t) = 
t>oo 
For example, assume that we have a system which is repeatedly available for 999 
days and then needs 1 day for repair. Such a system would have an availability of 
A = 0.999. 

Allowed failure rates can be in the order of 1 FIT. This may be several orders 
of magnitude less than the failure rates of chips. This means that systems must 
be more reliable than their components! Obviously, the required level of reliability 
makes fault-tolerance techniques a must! 

Obtaining actual failure rates is difficult. Figure 5.30 shows one of the few 
published results [546]. This figure contains failure rates for different gallium 
arsenide (GaAs) devices with the hottest transistor operating at a temperature of 
150°C. 

This example is used here to demonstrate that there exist devices for which the 
assumptions of constant failure rates or a bath tub-like behavior are oversimpli- 
fying.” As a result, citing a single MTTF number may be misleading. The actual 
distribution of failures over time should be used instead. In the particular case of 
this example, failure rates are less than 100 FIT for the first 20 years (175,300 h) of 
product lifetime, despite the high temperature. FIT numbers are actually very much 
temperature dependent, and temperatures up to 275°C and known temperature 
dependences have been used at TriQuint to compute failure rates for periods larger 
than the time available for testing. TriQuint claims that their GaAs devices are more 


Therefore, the so-called log-normal distribution is sometimes considered. 
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Fig. 5.31 Fault tree 


reliable than average silicon devices. Reports on FIT testing are also available for 
Xilinx FPGAs (see, e.g., [600]). 


5.6.5 Fault Tree Analysis, Failure Mode, and Effect Analysis 


It is frequently not possible to experimentally verify failure rates of complete 
systems. Requested failure rates are too small, and failures may be unacceptable. 
We cannot fly 10° airplanes 104 hours each in an attempt to check if we reach a 
failure rate of less than 107° (SIL-4)! The only way out of this dilemma is to use 
a combination of checking failure rates of components and formally deriving from 
this guarantees for a reliable operation of the system. Design- and user-generated 
failures also must be taken into account. It is state of the art to use decision diagrams 
to compute the reliability of a system from that of its components [260]. 

Damages are resulting from hazards (chances for a failure). For each possible 
damage caused by a failure, there is a severity (the cost) and a probability. Risk can 
be defined as the product of the two. Information concerning the damages resulting 
from component failures can be derived with at least two techniques [143, 459]: 


e Fault tree analysis (FTA): FTA is a top-down method of analyzing risks. The 
analysis starts with a possible damage and then tries to come up with possible 
scenarios that lead to that damage. FTA is based on modeling a Boolean function 
reflecting the operational state of the system (operational or not operational). 
FTA typically includes symbols for AND- and OR-gates, representing conditions 
for possible damages. OR-gates are used if a single event could result in a 
hazard. AND-gates are used when several events or conditions are required 
for that hazard to exist. Figure 5.31 shows an example.!° FTA is based on a 
structural model of the system, i.e., it reflects the partitioning of the system into 
components. 


!0Consistent with the ANSI/IEEE standard 91, we use the symbols &, =1 and >1 to denote AND-, 
XOR-, and OR-gates, respectively. 
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Table 5.3 FMEA table 


Component Failure Consequences Probability Critical? 


Processor Metal migration No service 1077 /h Yes 


The simple AND- and OR-gates cannot model all situations. For example, 
their modeling power is exceeded if shared resources of some limited amount 
(like energy or storage locations) exist. Markov models [67] may have to be used 
to cover such cases. Markov models are based on the notion of states, rather than 
on the structure of the system. 

e Failure mode and effect analysis (FMEA): FMEA starts at the components 
and tries to estimate their reliability. Using this information, the reliability of the 
system is computed from the reliability of its parts (corresponding to a bottom- 
up analysis). The first step is to create a table containing components, possible 
failures, probability of failures, and consequences on the system behavior. Risks 
for the system as a whole are then computed from the table. Table 5.3 shows an 
example. 


Tools supporting both approaches are available. Both approaches may be used 
in “safety cases”. In such cases, an independent authority has to be convinced 
that certain technical equipment is indeed safe. One of the commonly requested 
properties of technical systems is that no single failing component should potentially 
cause a catastrophe. 

The design of safe and dependable systems is a topic on its own. This book can 
only provide a few hints into this direction. There is an abundant amount of recent 
publications on the impact of reliability issues on system design. Examples include 
publications by Huang [223], Zhuo [613], and Pan [445]. For more information 
about dependability, consult books [181, 323, 339, 418, 513] on those areas. 


5.7 Simulation 


In this chapter, we have so far placed an emphasis on design evaluation. Starting 
with this section, we are now also considering validation. Simulation is a very 
common technique for evaluating and validating designs. Simulation consists of 
executing a design model on appropriate computing hardware, typically on general- 
purpose digital computers. Obviously, this requires models to be executable. All the 
executable models and languages introduced in Chap. 2 can be used in simulations, 
and they can be used at various levels as described starting at p. 115. The level at 
which designs are simulated is always a compromise between simulation speed and 
accuracy. The faster the simulation, the less accuracy is available. 
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So far, we have used the term behavior in the sense of the functional behavior 
of systems (their input/output behavior). There are also simulations of some 
non-functional behaviors of designs, including the thermal behavior and the elec- 
tromagnetic compatibility (EMC) with other electronic equipment. Due to the 
integration with physics, there is a large range of physical effects which may have 
to be included in the simulation model. As a result, it is impossible to cover all 
relevant approaches for simulating cyber-physical systems in this book. Law [325] 
provides an overview of approaches and topics in simulations on digital systems. A 
large amount of additional information on the simulation of systems (in particular 
of heterogeneous, cyber-physical systems) is available (see, e.g., [126, 362, 442]). 
Some simulators specialize on specific application areas. Due to the large number 
of physical effects, it is impossible to provide a complete list of references. 

For cyber-physical systems, simulations have serious limitations: 


e Simulations are typically a lot slower than the actual design. Hence, if we 
interface the simulator with the actual environment, we can have quite a number 
of violations of timing constraints. 

e Simulations in the physical environment may even be dangerous (who would 
want to drive a car with unstable control software?). 

e For many applications, there may be huge amounts of data, and it may be 
impossible to simulate enough data in the available time. Multimedia applications 
are notoriously known for this. For example, simulating the compression of some 
video stream takes an enormous amount of time. 

e Most actual systems are too complex to allow simulating all possible cases 
(inputs). Hence, simulations can help us to find errors in our designs. They cannot 
guarantee absence of errors, since simulations cannot exhaustively be done for all 
possible combinations of inputs and internal states. 


Due to these limitations, there is an increased emphasis on validation by formal 
verification (see p. 290). Nevertheless, sophisticated simulation techniques continue 
to play a key role for validation (see, e.g., Braun et al. [66]). Academic solutions 
like gem5 (see http://gem5.org), SimpleScalar, and OpenModelica as well as 
commercial solutions like the Synopsys® Virtualizer™ (see http://synopsys.com) 
are available. There are several tools for the simulation of networks (as required for 
the Internet of Things), including OMNET++ (see https://omnetpp.org/). 


5.8 Rapid Prototyping and Emulation 


Simulations are based on models, which are approximations of real systems. In 
general, there will be some difference between the real system and the model. We 
can reduce the gap by implementing some parts of our system under design (SUD) 
more precisely than in a simulator (e.g., in a real, physical component). 


5.8 Rapid Prototyping and Emulation 289 


Definition 5.42 Adopting a definition phrased by M‘°Gregor [383], we define 
emulation as the process of executing a model of the SUD where at least one 
component is not represented by simulation on some kind of host computer. 


According to M°Gregor, “Bridging the credibility gap is not the only reason 
for a growing interest in emulation — the above definition of an emulation model 
remains valid when turned around — an emulation model is one where part of the 
real system is replaced by a model. Using emulation models to test control systems 
under realistic conditions, by replacing the ...(real system) ...with a model, is 
proving to be of considerable interest to those responsible for commissioning, or 
the installation and start-up of automated systems of many kinds.” 

In order to further improve credibility, we can continue replacing simulated 
components by real components. These components do not have to be the final 
components. They can be approximations of the real system itself but should exceed 
the precision of simulations. 

Note that it is now common to discuss the “emulation” of one computer on 
another computer by means of software. There is a lack of a precise definition of 
the use of the term in this context. However, it can be considered consistent with our 
definition, since the emulated computer is not just simulated. Rather, a speed faster 
than simulation speed is expected. 


Definition 5.43 Fast prototyping is the process of executing a model of the SUD 
where no component is represented by simulation on some kind of host computer. 
Rather, all components are represented by realistic components. Some of these 
components should not yet be the finally used components (otherwise, this would 
be the real system). 


There are many cases in which the designs should be tried out in realistic 
environments before final versions are manufactured. Control systems in cars are 
an excellent example for this. Such systems should be used by drivers in different 
environments before mass production is started. Accordingly, the automotive 
industry designs prototypes. These prototypes should essentially behave like the 
final systems, but they may be larger, have more power consuming, and have other 
properties which test drivers can accept. The term “prototype” can be associated 
with the entire system, comprising electrical and mechanical components. However, 
the distinction between rapid prototyping and emulation is also blurring. Rapid 
prototyping is by itself a wide area which cannot be comprehensively covered in 
this book. 

Prototypes and emulators can be built, for example, using FPGAs. Racks 
containing FPGAs can be stored in the trunk while test drivers exercise the car. This 
approach is not limited to the automotive industry. There are several other fields in 
which prototypes are built from FPGAs. Commercially available emulators consist 
of a large number of FPGAs. They come with the required mapping tools which map 
specifications to these emulators. Using these emulators, experiments with systems 
which behave “almost” like the final systems can be run. However, catching errors 
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by prototyping and emulation is already a problem for non-distributed systems. For 
distributed systems, the situation is even more difficult (see, e.g., Tsai [547]). 


5.9 Formal Verification 


Formal verification!! is concerned with formally proving a system correct, using 
the language of mathematics. First of all, a formal model is required to make formal 
verification applicable. This step can hardly be automated and may require some 
effort. Once the model is available, we can try to prove certain properties. 

Formal verification techniques can be classified by the type of logic employed: 


e Propositional logic: In this case, models consist of Boolean expressions. Tools 
are called Boolean checkers, tautology checkers, or equivalence checkers. 
They can be used to verify that two representations of Boolean functions (or sets 
of Boolean functions) are equivalent. Since propositional logic is decidable, it is 
also decidable whether or not the two representations are equivalent (there will 
be no cases of doubt). For example, one representation might correspond to gates 
of an actual circuit and the other to its specification. Proving the equivalence then 
proves the effect of all design transformations (e.g., optimizations for power or 
delay) to be correct. Boolean checkers can cope with designs which are too large 
to allow simulation-based exhaustive validation. The key reason for the power 
of Boolean checkers is the use of binary decision diagrams (BDDs) [571]. The 
complexity of equivalence checks of Boolean functions represented with BDDs 
is linear in the number of BDD nodes. The number of BDD nodes can potentially 
grow exponentially with the number of variables, but, in practice, many relevant 
functions can be represented with compact BDDs.! In contrast, the equivalence 
check for functions represented by sums of products is NP-hard. BDD-based 
equivalence checkers have therefore replaced simulators for this application and 
handle circuits with millions of transistors. 

e First-order logic (FOL): FOL adds 3 and Y quantifiers to propositional logic. 
Some automation for verifying FOL models is feasible. However, since FOL is 
undecidable, there may be cases of doubt. Popular techniques include the Hoare 
calculus. Typically, operations on integers are also supported. 

¢ Higher-order logic (HOL): Higher-order logic is based on lambda calculus and 
allows functions to be manipulated like other objects [423]. For higher-order 
logic, proofs can hardly ever be automated and typically must be done manually 
with some proof support. 


11 This initial text on formal verification was based on a guest lecture given by Tiziana Margaria at 
TU Dortmund. 


!2Multiplication is a prominent exception [284]. 
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Propositional logic can be used to verify stateless logic networks but cannot 
directly model finite state machines. For short input sequences, it may be sufficient 
to cut the feedback loop in FSMs and to effectively deal with several copies of 
these FSMs, each copy representing the effect of one input pattern. However, this 
method does not work for longer input sequences. Such sequences can be handled 
with model checking. 

For model checking, we have two inputs to the verification tool: 


1. The model to be verified 
2. Properties to be verified 


States can be quantified with 4 and Y; numbers cannot. Verification tools can 
prove or disprove the properties. In the latter case, they can provide a counterexam- 
ple. Model checking is easier to automate than FOL. It has been implemented for 
the first time in 1987, using BDDs. It was possible to locate several errors in the 
specification of the future bus protocol [104]. UPPAAL is a very popular tool for 
model checking. !° 

This technique could be used, for example, to prove properties of the railway 
model of Fig. 2.52 (see p. 82). It should be possible to convert the Petri net into a 
state chart and then confirm that the number of trains commuting between Cologne 
and Paris is indeed constant, confirming our discussion of Petri net place invariants 
on p. 81. 


5.10 Problems 


We suggest solving the following problems either at home or during a flipped 
classroom session: 


5.1 Let us consider an example demonstrating the concept of Pareto optimality. 
In this example, we study the results generated by task concurrency management 
(TCM) tools designed at the IMEC research center (nteruniversitair Micro- 
Electronica Centrum). TCM tools aim at establishing efficient mappings from 
applications to processors. Different multiprocessor systems are evaluated and 
represented as sets of Pareto optimal designs. Wong et al. [595] describe different 
options for the design of an MPEG-4-player. The authors assume that a combination 
of StrongARM processors and specialized accelerators should be used. Four designs 
meet the timing constraint of 30 ms (see Table 5.4). These different designs are 
shown in Fig.5.32. For combinations 1 and 4, the authors report that only one 
mapping of tasks to processors meets the timing constraints. For combinations 2 and 
3, different time budgets lead to different task to processor mappings and different 
energy consumptions. 


'3See http://www.uppaal.org for the academic and http://www.uppaal.com for the commercial 
version. 
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Table 5.4 Processor 


. Processor combination 1 |2 |3 |4 
configurations Number of high-speed processors |6 |5 |4 |3 
Number of low-speed processors |O |3 |5 |7 
Total number of processors l6 8 9 |10 
A Configuration 2 
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Fig. 5.32 Pareto points for multiprocessor systems 2 and 3 
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Fig. 5.33 Abstract cache states 


Which area in the objective space is dominated by at least one design of 
configuration 3? Is there any design belonging to configuration 2 which is not 
dominated by at least one design of configuration 3? Which area in the objective 
space dominates at least one design of configuration 3? 


5.2 Which conditions must be met by computations of WCET esr? 


5.3 Let us consider cache states at a control flow join. Figure 5.33 shows abstract 
cache states before the join. 

Now let us look at abstract cache states after the join. Which state would a must- 
analysis derive? Which state would a may-analysis derive? 


5.4 Consider an incoming “bursty” event stream. The stream is periodic with a 
period of 7. At the beginning of each period, two events arrive with a separation 
of d time units. Develop arrival curves for this stream! Resulting graphs should 
display times from 0 up to 3*7. 


5.5 Suppose that you are working with a processor having a maximum performance 
of b. 
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1. What do the service curves look like if the performance can deteriorate to b’, due 
to cache conflicts? 

2. How do the service curves change if some timer is interrupting the executed 
program every 100 ms and if servicing the interrupt takes 10 ms? Assume that 
there are no cache conflicts. 

3. How do the service curves look like if you consider cache conflicts like in (1.) 
and interrupts like in (2.)? 


Resulting graphs should display times from 0 up to 300 ms. 


5.6 Suppose that we try to collect amber. However, there is the risk of also 
collecting white phosphorus. Suppose that we collect 50 objects. We keep all of 
them in water to avoid fire hazards. We classify 30 objects as amber and 20 as 
white phosphorus. However, two of the objects classified as amber are actually 
pieces of white phosphorus and 8 objects classified as white phosphorus are actually 
consisting of amber. Compute the precision, recall, accuracy, and specificity for this 
classification! 


5.7 Suppose that you try to compute the power consumption of your mobile phone 
using a shunt resistor. The following values are relevant for the computation of the 
power consumption at some time t: resistor, 0.47 Q; power supply voltage, 5.1 V; 
and voltage across shunt, 0.23 V. What is the power consumption of your mobile at 
this time t? 


5.8 Consider a copper plate of area A=10 cm? and length 5 mm. How much thermal 
power is transferred if the difference between the temperatures at the two ends of 
the plate is 10°C? 


5.9 Consider a hard disk drive for which we assume that half of the drives have 
failed after 5000h of operation. Let us assume that failures follow an exponential 
distribution. Compute the corresponding value of A! 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
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Chapter 6 A 
Application Mapping TCA 


Mapping of applications onto available hardware platforms is a key design step. 
We need to map applications both to processors and to particular execution times. 
This is feasible with appropriate scheduling techniques. Taking as many scheduling 
decisions as reasonable at design time enables us to provide timing guarantees. 
In this chapter, we will present a selected subset of the corresponding static 
scheduling techniques. They will be classified according to the triplet notation 
proposed by Pinedo and others. First of all, we will explain classical scheduling 
algorithms for single processors. We will cover algorithms for aperiodic as well as 
for periodic task systems, including the well-known earliest deadline first (EDF) 
and rate monotonic scheduling (RMS) algorithms. We will briefly explain the use 
of bin packing algorithms for homogeneous multiprocessor systems. This will be 
followed by a presentation of selected scheduling algorithms for heterogeneous 
multiprocessors. We will be presenting algorithms for independent and dependent 
jobs. For dependent jobs, the focus is on heuristics. Finally, we will be pointing 
toward issues in using dynamic scheduling. 


6.1 Definition of Scheduling Problems 


6.1.1 Elaboration on the Design Problem 


The mentioned mapping to execution platforms is included in the simplified design 
flow, as shown in Fig. 6.1. 

Selected scheduling algorithms should allow us to use systems with a certain 
combination of applications. For example, for a mobile phone, we expect being able 
to make a phone call while the Bluetooth stack is transmitting the audio signals to 
a headset and while we are looking up information in our “personal information 
manager” (PIM). At the same time, there may be a concurrent file transfer or even 
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Fig. 6.1 Simplified design information flow 


a video connection. We must make sure that these applications can be used together 
and that we are keeping the deadlines (no lost audio samples!). This is feasible 
through an analysis of the use cases. 

It is a characteristic of embedded and cyber-physical systems that both hardware 
and software must be considered during their design. Therefore, this type of design 
is also called hardware/software codesign. The overall goal is to find the right 
combination of hardware and software resulting in the most efficient product 
meeting the specification. Therefore, embedded systems cannot be designed by 
a synthesis process taking only the behavioral specification into account. Rather, 
available components must be accounted for. There are also other reasons for this 
constraint: in order to cope with the increasing complexity of embedded systems 
and their stringent time-to-market requirements, reuse is essentially unavoidable. 
This led to the term platform-based design: 

“A platform is a family of architectures satisfying a set of constraints imposed 
to allow the reuse of hardware and software components. However, a hardware 
platform is not enough. Quick, reliable, derivative design requires using a platform 
application programming interface (API) to extend the platform toward application 
software. In general, a platform is an abstraction layer that covers many possi- 
ble refinements to a lower level. Platform-based design is a meet-in-the-middle 
approach: in the top-down design flow, designers map an instance of the upper 
platform to an instance of the lower, and propagate design constraints” [476]. 

The mapping is an iterative process in which performance evaluation tools guide 
the next assignment. 

In this book, we focus on embedded system design based on available execution 
platforms. This reflects the fact that many modern systems are being built on top 
of some existing platform. Techniques other than the ones described in this book 
must be used when the execution platform needs to be designed as well. Due to 
our focus, the mapping of applications to execution platforms can be seen as 
the main design problem. In the general case, mapping will be performed onto 
multiprocessor systems. 

Even for platform-based design, there may be a number of design options. We 
might be able to select between different variants of a platform, where each variant 
might have a different number of processors, different speeds of processors, or a 
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different communication architecture. Moreover, there may be different applicable 
scheduling policies. Appropriate options must be selected. 

This leads us to the following definition of our mapping problem [535]: 
Given: 


e a set of applications, 
e use cases describing how the applications will be used, 
e a set of possible candidate architectures: 


— (possibly heterogeneous) processors, 
(possibly heterogeneous) communication architectures, 
— possible scheduling policies. 


Find: 


e amapping of applications to processors, 
e appropriate scheduling techniques (if not fixed), 
e a target architecture (if not fixed). 


Objectives: 


e Keeping deadlines and/or maximizing performance, 
e minimizing cost, energy consumption, and possibly other objectives. 


The exploration of possible architectural options is called design space exploration 
(DSE). As a special case, we may consider a completely fixed platform architecture. 


Designing an AUTOSAR-based automotive system can be seen as an example: 
in AUTOSAR [28], we have a number of homogeneous execution units (called 
ECUs) and a number of software components. The question is: how do we map 
these software components to the ECUs such that all real-time constraints are met? 
We would like to use the minimum number of ECUs. 

For embedded systems, we can assume that the set of applications comprises 
a number of tasks which are released (are ready for execution) repeatedly. The 
executed code can be associated with tasks. For example, there may be the need 
to execute certain code once for every input sample. We denote each task by t; and 
sets of tasks by tT = {T,..., Tn}. 


Definition 6.1 Each execution of a task is called a job (cf. Definition 4.4). For each 
task t;, there is an associated set of jobs J(t;). Due to the repeated executions, the 
set of jobs of task qt; is possibly not finite. 


Definition 6.2 Tasks t; which are released once every T; units of time are called 
periodic tasks, and 7; is called their period. 


Definition 6.3 A task q; is called sporadic if there is a lower bound on the length 
of the interval between successive releases of this task. For each sporadic task ti, 
we call this interval length also 7;. 
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Fig. 6.2 Notation used for jobs 


This minimum separation is important: without such a separation, arrival curves for 
any interval A could become unbounded. It would be impossible to find a schedule 
for a bounded set of resources. 


Definition 6.4 Tasks which are neither periodic nor sporadic are called aperiodic. 


For periodic and sporadic task systems, the concept of hyper-periods simplifies 
scheduling substantially: 


Definition 6.5 Let t be a periodic or sporadic task system. Its hyper-period is 
defined as the least common multiple of the periods of the individual tasks. 


If tasks can be scheduled for one hyper-period, they can be scheduled for all hyper- 
periods, due to the repeating nature of the task structure. 


6.1.2 Types of Scheduling Problems 


The following notation is used in the remainder of this chapter for jobs. Let J = {Jj} 
be a set of jobs. Let (see Fig. 6.2): 


e ri be the release time of J; (the time at which it becomes available for execution), 

e Ci be the worst case execution time (WCET) of J;, 

e di be the (absolute) deadline of J;, 

¢ Dj be the relative deadline, that is, the time between a job J; becoming available 
and the time until which the same job J; has to finish execution (D; = di — ri), 

e l; be the laxity or slack, defined as 


li = Di — Ci (6.1) 


Gf l; = 0, then J; has to be started immediately after it is released), 
e si be the actual starting time of J;, 
e fi be the actual finishing time of Jj. 


In figures like Fig. 6.2, upward pointing vertical arrows indicate the release of jobs, 
and downward pointing arrows denote the deadline of jobs. 

In the following, we will be using the triplet classification for scheduling 
problems which was presented by Pinedo [455], based on an notation introduced 
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earlier by Graham et al. [190]. According to the notation, scheduling problems can 
be classified by a triplet: 


(@|Bly). (6.2) 


The « Field 


The « field describes the machine environment and consists of a single entry. Simple 
scheduling algorithms handle the case of single processors, whereas more complex 
algorithms also handle systems comprising multiple processors. In this book, we 
consider the following possible values of the « field: 


e A value of 1 indicates a single processor. 

e A value of Pm indicates m processors which can be used in parallel. Each job 
can be executed with the same speed on any of the m processors. In this case, 
processors are said to be identical (or homogeneous). The £ field can be used to 
express constraints for the allocation of jobs to processors. 

e A value of Qm denotes parallel processors with different performances. The 
performance is expressed as scaling factors relative to the performance of the 
slowest processor. Scaling factors can be represented by a vector (51, .., Sm), 
where component sg is the scaling factor of processor zr. In this case, processors 
are called uniform. The uniform processor model is very much simplified; we 
will hardly refer to it. 

e A value of Rm indicates m processors with unrelated processing speeds. The 
execution time of the job or task i on processor k is Cik. Processors are 
called heterogeneous. Heterogeneous processors can be optimized for particular 
objectives, e.g., for high performance or a small energy consumption. Hence, 
heterogeneous processors are very important for embedded systems. Hardware 
accelerators can be modeled as special-purpose processors. 


The @ field will always contain just a single element. 


The £ Field 


The £ field describes processing restrictions. This field may contain several 
components. In this book, we will consider the following possible values of this 
field: 


e An entry r; denotes existing release times that are depending on the job i to be 
allocated. 

e Anentry prmp indicates that preemptions are allowed. Non-preemptive schedul- 
ing is assumed if this entry is missing. Non-preemptive schedulers are based on 
the assumption that jobs are executed until they are done. As a result, the response 
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time for external events! may be quite long if some jobs have a large execution 
time. Preemptive schedulers must be used if some jobs have long execution times 
or if the response time for external events is required to be short. However, 
preemption can result in unpredictable execution times of the preempted jobs. 
Therefore, restricting preemptions may be required in order to guarantee meeting 
the deadline of hard real-time jobs. 

e Another possible entry would describe the type of timing constraints. We can 
distinguish between soft and hard deadlines (see Definition 1.8 on p. 10). 

Scheduling for soft deadlines is frequently based on extensions to standard 
operating systems. We will not discuss these systems further in this book. 
Therefore, the default assumption in this book is to have hard timing constraints. 

e Entries periodic and sporadic may describe the type of task system considered. 

e A value of prec expresses the fact that precedence constraints exist. Precedences 
among the jobs require jobs to be executed according to certain partial orders. 
They may be caused by communication between jobs. For embedded systems, 
precedences are the rule rather than an exception. 

e For sporadic and periodic task sets, we are frequently differentiating scheduling 
problems with respect to their deadlines: 

The case D; = T;, for all i, is called the case of implicit-deadline tasks, 
or Liu-and-Layland (L&L) tasks [348]. This case is indicated by an entry 
Di = T;. Task sets which must satisfy Vi : D; < T; are called constrained- 
deadline tasks. 

Tasks whose deadlines do not need to meet any constraints regarding their 
period are called arbitrary-deadline tasks. These cases can also be indicated 
by corresponding entries. 

e We could use this field also to describe the type of scheduling employed. For 
example, we could use entries fixed-job-prio and fixed-task-prio for jobs and 
tasks with a fixed priority. 

Furthermore, we could distinguish between static and dynamic scheduling. 
Dynamic schedulers take decisions at run-time. They are quite flexible but 
generate overhead at run-time. Also, they are usually not aware of global contexts 
such as resource requirements or precedences between jobs. For embedded 
systems, such global contexts are typically available at design time, and they 
should be exploited. 

Static schedulers take their decisions at design time. They are based on 
planning the start times of jobs and generate tables of start times forwarded to 
a simple dispatcher. The dispatcher does not take any decisions, but is just in 
charge of starting jobs at the times indicated in the table. The dispatcher can be 
controlled by a timer, causing the dispatcher to analyze the table. 


'This is the time from the occurrence of an external event until the completion of the reaction 
required for the event. 
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Fig. 6.3 TDLina Time | Action WCET 
time-triggered system 
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Systems which are totally controlled by a timer are said to be entirely time- 
triggered (TT systems). Such systems are explained in detail in the book by 
Kopetz [303]: 

“In an entirely time-triggered system, the temporal control structure of all 
tasks is established a priori by off-line support-tools. This temporal control 
structure is encoded in a Task-Descriptor List (TDL) that contains the cyclic 
schedule for all activities of the node? (Fig. 6.3). This schedule considers the 
required precedence and mutual exclusion relationships among the tasks such 
that an explicit coordination of the tasks by the operating system at run time is 
not necessary. ... The dispatcher is activated by the synchronized clock tick. It 
looks at the TDL, and then performs the action that has been planned for this 
instant ....” 

The main advantage of static scheduling is that it can be easily checked if 
timing constraints are met: “For satisfying timing constraints in hard real-time 
systems, predictability of the system behavior is the most important concern; pre- 
run-time scheduling is often the only practical means of providing predictability 
in a complex system” [604]. The main disadvantage is that the response to events 
may be quite poor. 


Multiprocessor scheduling algorithms either can be executed locally on one 
processor or can be distributed among a set of processors. Hence, we can also 
distinguish between centralized and distributed scheduling. This distinction 
could also be expressed in the £ field. 


The y Field 


The y field describes the objective function. In this book, we consider the following 
possible values of this field: 


An entry of Lmax means that the maximum lateness is to be minimized. 


Definition 6.6 Maximum lateness is defined as the difference between the 
completion time and the deadline, maximized over all jobs. 


Maximum lateness is negative if all tasks complete before their deadline. 


>This term refers to a processor in this case. 
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e An entry of M Smax denotes the case of minimizing the makespan (the time at 
which the last job finishes). 


Definition 6.7 The makespan is defined as? 
M Smax = max; (fi) (6.3) 


e In addition to the entries considered by Pinedo, other entries are relevant 
for embedded systems. For example, we might want to minimize the energy 
consumption, or we might even consider trade-offs between several objectives. 


A huge amount of scheduling algorithms is available, and comprehensive 
coverage of existing algorithms would be infeasible even if an entire book or 
course were available. In a standard undergraduate curriculum, there is typically not 
enough headroom for a dedicated course on scheduling (but this may be different 
for courses for graduate students). Therefore, we provide only a brief introduction to 
scheduling in this book. Many scheduling problems are known to be very complex 
[41, 455]. In many cases, only approximately optimal mappings can be guaranteed. 
We will provide an overview of scheduling algorithms frequently considered in 
embedded systems. Table 6.1 comprises an overview of the techniques in this 
chapter. From left to right, columns refer to the processor model, asynchronous 
arrival times, preemptiveness, precedences, periodic/sporadic tasks vs. aperiodic 
jobs, the deadline model (for periodic/sporadic tasks), job- vs. task-based priorities 
(for periodic/sporadic tasks), global vs. local scheduling (for multiprocessors), 
the objective, the subsection, and the name of algorithm(s). Algorithms like 
earliest deadline first are designed for nonperiodic systems but can be applied in 
periodic/sporadic systems as well. Note that only the last three lines correspond to 
full support for heterogeneous processors, as can be seen in column one. Uniform 
processors will be mentioned only as a possible use of the 0/1 multi-knapsack 
model. If all jobs arrive at the same time (indicated by an entry of “—” for the second 
column), preemption is useless, and hence, the third column is not marked by an 
X. Entries for column D; are relevant only for periodic/sporadic tasks. Regarding 
the objectives, we observe that lateness is the relevant objective in many cases. 
However, for periodic/sporadic scheduling, the key question is: is there a schedule 
which meets the deadlines? Bin packing is designed to minimize the number 
of processors. For the HEFT and CPOP heuristics, the makespan is the relevant 
objective. Only the last line corresponds to a minimization of several objectives, in 
the form either of a single objective at a time or of real multi-objective optimization 
using Pareto optimality. 

Scheduling is similar to performance evaluation in that it cannot be constrained 
to a single design step. Rather, scheduling algorithms may be required a number 
of times during the design of such systems. Very rough calculations may already be 
required while fixing the specification. Later, more detailed predictions of execution 


3Pinedo denotes the makespan as Cmax. We prefer to avoid confusion with execution times Cj. 
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times may be required. After compilation, even more detailed knowledge exists 
about the execution times, and accordingly, more precise schedules can be made. 
Finally, it may be necessary to decide at run-time which task is to be executed next. 
In contrast, in time-triggered systems, RTOS scheduling may be limited to simple 
table look-ups for tasks to be executed. 

In practice, it is very important to know whether or not a schedule exists for 
a given set of tasks and constraints. A set of tasks is said to be schedulable 
under a given set of constraints if a schedule exists for that set of tasks and 
constraints. For many applications, schedulability tests are important. Tests which 
always return precise results (called exact tests) are NP-hard in many situations 
[178]. Therefore, sufficient and necessary tests are used instead. For sufficient tests, 
sufficient conditions for guaranteeing a schedule are checked. There is a (hopefully 
small) probability of indicating that scheduling cannot be guaranteed even when a 
schedule exists. Necessary tests are based on checking necessary conditions. They 
can be used to show that no schedule exists. However, there may be cases in which 
necessary tests are passed and the schedule still does not exist. 


6.2 Scheduling for Uniprocessors 


Let us first consider the case of uniprocessor systems. According to the triplet 
notation, this corresponds to the case (1|..|..). We are using some of the material 
from the book by Buttazzo [81] for this section. Refer to this book for additional 
references. 


6.2.1 Scheduling for Independent Jobs 


Furthermore, we are restricting our discussion initially to the even more special case 
of independent jobs executed on uniprocessors. 


Earliest Due Date (EDD) Algorithm 


First of all, we are looking at the situation where all jobs arrive at the same time, 
and we try to minimize lateness. If all jobs arrive at the same time, preemption is 
obviously useless. Therefore, according to the triplet notation, we are considering 
the case (1| |Lmax). A very simple rule for this case was found by Jackson in 1955 
[263]. 


Theorem 6.1 (Jackson’s Rule) Given a set of n independent jobs with deadlines, 
any algorithm that executes the jobs in order of nondecreasing deadlines is optimal 
with respect to minimizing the maximum lateness. 
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Fig. 6.4 Schedules S and S’ Ss Jp Ja 
S' Ja Jb 
ta = f'o 


The algorithm following this rule is called the earliest due date (EDD) algorithm. 
If the deadlines are known in advance, EDD can be implemented as a static 
scheduling algorithm. EDD requires all jobs to be sorted by their deadlines. Hence, 
its complexity is O (n log(n)). 


Proof of the Optimality of EDD Let S be a schedule generated by any algorithm A. 
Suppose A does not lead to the same result as EDD. Then, there are jobs J, and 
Jp such that the execution of Jp precedes the execution of J, in J, even though the 
deadline of J, is earlier than that of J, (da < dp). Now, consider a schedule S’. S’ 
is generated from S by swapping the execution orders of Ja and Jp (see Fig. 6.4). 

In schedule S, the deadline of J, is earlier than that of Jp, but Jp is executed first. 
Hence, the maximum lateness among jobs J, and J; is that of Ja, or Limax (a, b) = 
fa — da. 

For schedule S’, L'max (a, b) = max (L'a, L'b) is the maximum lateness among 
jobs J, and Jp. L'a is the maximum lateness of job Ja in schedule S’. L'b is defined 
accordingly. There are two possible cases: 


1. L'a > L’b: In this case, we have 
L'max(a, b) = f'a — da 
Ja terminates earlier in the new schedule. Therefore, we have 
L'max(a, b) = f'a — da < fa — da. 
The right side of this inequality is the maximum lateness in schedule S. Hence, 
the following holds: 
L'max(a, b) < Linax (a, b) 
2. L'a < L'b: 
In this case, we have 
L'max(a, b) = f'b — dp = fa — dp (see Fig. 6.4). 
The deadline of Ja is earlier than the one of Jp. 
This leads to 
L'max(a, b) < fa — da 
Again, we have 
L'max(a, b) < Linax (a, b) 


As a result, any schedule (which is not an EDD schedule) can be turned into an EDD 
schedule by a finite number of swaps. Maximum lateness can only decrease during 
these swaps. Therefore, EDD is optimal for this class of scheduling problems. o 
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Earliest Deadline First (EDF) Algorithm 


Let us consider the case of different release times for uniprocessor systems 
next. Under this scenario, preemption can potentially reduce maximum lateness. 
According to the triplet notation, this corresponds to the case (1|r;, prmp|Limax). 

The earliest deadline first (EDF) algorithm is optimal with respect to minimizing 
the maximum lateness. It is based on the following theorem [222]: 


Theorem 6.2 Given a set of n independent jobs with arbitrary arrival times, any 
algorithm that at any instant executes the job with the earliest absolute deadline 
among all the ready jobs is optimal with respect to minimizing the maximum 
lateness. 


EDF requires that each time a new ready job arrives, it is inserted into a queue 
of ready jobs, sorted by their deadlines. Hence, EDF is a dynamic scheduling 
algorithm. If a newly arrived job is inserted at the head of the queue, the currently 
executing job is preempted. If sorted lists are used for the queue, the complexity of 
EDF is O (n”). Bucket arrays could be used for reducing the execution time, but this 
option is typically not considered. 


Example 6.1 Figure 6.5 shows a schedule derived with the EDF algorithm. At time 
4, job J2 has an earlier deadline. Therefore, it preempts J). At time 5, job J3 arrives. 
Due to its later deadline, it does not preempt J2. The deadline of J; is lather than that 
of J3, and hence, it resumes only after J3 has terminated. Priorities are obviously 
dynamic: they depend on which deadline is next. Since EDF uses dynamic priorities, 
it cannot be used with an operating system providing only fixed priorities. However, 
it has been shown that operating systems can be extended to simulate an EDF policy 
at the application level [132]. V 


Proof of Theorem 6.2 Let S be a schedule generated by some algorithm A, where A 
is different from EDF. Let Sg pr be a schedule generated by EDF. Now, we partition 
time into disjoint intervals of length 1.4 Each interval comprises times within the 
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Fig. 6.5 EDF schedule 


4This proof assumes a discrete time domain. It can be extended to a continuous time domain. 
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Fig. 6.7 Schedule after swapping jobs S(t) and E(t) 


range [t, t+1). Let S(t) be the job which—according to schedule S—is executed 
during the interval [t, +1). Let E(t) be the job which at time ¢ has the earliest 
deadline among all jobs. Let tg (t) be the time (> t) at which job E(t) is starting its 
execution in schedule S. S is not an EDF schedule. Therefore, there must be a time 
t at which we are not executing the job having the earliest deadline. For t, we have 
S(t) # E(t) (see Fig. 6.6). 

Using the same arguments as for Jackson’s rule, we can show that swapping 
S(t) # E(t) like in Fig. 6.7 does not increase maximum lateness. Therefore, by 
a number of swaps, any non-EDF schedule can be turned into an EDF schedule 
without increasing maximum lateness. This proves that EDF is optimal among all 
possible scheduling algorithms. 

We can show that swapping will keep all deadlines, provided they were kept 
in schedule S. According to the initial assumption, the maximum lateness in 
the schedule S is 0. Since EDF returns the optimal schedule for minimizing the 
maximum lateness, the maximum lateness of the EDF schedule is also 0. Hence, for 
this problem class, the EDF schedule is the optimal schedule to meet the deadlines. 

| 


Least Laxity (LL) Algorithm 


Focusing on laxity, we are now considering the case (1 | r;, prmp, .. |..), with the goal 
of finding a schedule if one exists. Least laxity (LL), least slack time first (LST), and 
minimum laxity first (MLF) are three names for a laxity-based scheduling strategy 
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Fig. 6.8 Least laxity schedule 


[347]. According to LL scheduling, job priorities are a monotonically decreasing 
function of the laxity (see Eq. (6.1); the less laxity, the higher the priority). Laxity is 
dynamically changing and needs to be dynamically recomputed. 


Example 6.2 Figure 6.8 shows an LL schedule. Computation of the laxity is 
included. At time 4, job Jı is preempted, as before. At time 5, Jz is now also 
preempted, due to the lower laxity of job J3. V 


LL scheduling is also preemptive. Preemptions are not restricted to times at 
which new jobs become available. Negative laxities provide an early warning for 
deadlines to be missed. It can be shown (this is left as an exercise in [347]) that 
LL is also an optimal scheduling policy for uniprocessor systems with meeting 
deadlines as the objective. This means that it will find a schedule if one exists. Due 
to its dynamic priorities, it cannot be used with a standard OS providing only fixed 
priorities. Furthermore, LL scheduling—in contrast to EDF scheduling—requires 
the knowledge of the execution time and typically generates many context switches. 
Its use is therefore restricted to special situations where its properties are attractive. 
Also, laxity can play a role in multiprocessor scheduling, as will be shown in 
Sects. 6.3.3 and 6.3.4. 


Scheduling Without Preemption 


Let us now consider the case of not allowing preemptions, denoted as (1|r;|Lmax). 


Theorem 6.3 If preemption is not allowed, optimal schedules must leave the 
processor idle at certain times in order to finish jobs with early deadlines arriving 
late. 


Proof Let us assume that an optimal non-preemptive scheduler (not having knowl- 
edge about the future) never leaves the processor idle. This scheduler must schedule 
the example of Fig. 6.9 optimally (it must find a schedule if one exists). For the 
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Fig. 6.9 Scheduler needs to leave processor idle 


example of Fig.6.9, we assume we are given two tasks. Let tı be a periodic task 
with C; = 2, Ti = 4, Dj = 4, and rı = 0. Let t2 be a sporadic task with Cz = 1, 
Dz = 1, Ty = 4, and r = 1, i.e., sporadically becoming available at times 4* + 1. 

Under the above assumptions, our scheduler has to start the execution of task t1 
at time 0, since it is supposed not to leave any idle time. Since the scheduler is non- 
preemptive, it cannot start t2 when it becomes available at time 1. Hence, t2 misses 
its deadline. If the scheduler had left the processor idle (as shown in Fig. 6.9 at time 
4), a legal schedule would have been found. Hence, the scheduler is not optimal. 
This is a contradiction to the assumptions that optimal schedulers not leaving the 
processor idle at certain times exist. o 


We conclude in order to avoid missed deadlines, the scheduler needs knowledge 
about the future. Such algorithms are called clairvoyant. An algorithm leaving the 
processor idle in the presence of executable tasks is not work-conserving: 


Definition 6.8 A scheduling algorithm is work-conserving if it does not allow 
there to be a time at which a processor is idle and there is an executable task [119]. 


If no knowledge about the arrival times is available a priori, then no online algorithm 
can decide whether or not to keep the processor idle. 

If arrival times are known a priori, the scheduling problem becomes NP-hard 
in general, and branch and bound techniques are typically used for generating 
schedules. 


6.2.2 Scheduling with Precedence Constraints 


Next, let us consider precedence constraints, according to the triplet notation 
denoted as (1I r;, prmp, prec | Lmax). 


Task Graphs 


Precedence constraints are expressed by directed acyclic graphs (DAGs, cf. Defini- 
tion 2.6) G = (t, E). The set t represents the vertices (or nodes) of the DAG and 
E C t x T its edges. 
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Fig. 6.10 Task DAG 


Fig. 6.11 Precedence graph 
and schedule 
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Example 6.3 In Fig. 6.10, edges express that source nodes (the first components of 

the tuples representing edges) must be executed before their sink nodes (the second 

components of the tuples representing edges). Vertex labels denote task numbers. 
V 


There may be several reasons for describing applications as DAGs: 


1. On the one hand, each vertex might correspond to an instance of a task, and edges 
would then represent dependencies between tasks. 

2. On the other hand, the availability of multiprocessors leads to the idea of splitting 
tasks into subtasks and executing these subtasks in an overlapping manner on 
different processors. Each vertex could then correspond to a subtask. Automatic 
partitioning of tasks into subtasks such that parallel processors can be efficiently 
exploited is called automatic parallelization. Automatic parallelization is even 
more difficult than automatic scheduling for a given number of subtasks. 


Both cases of creating DAGs can be used in combination: we can have dependencies 
among tasks, and tasks can be split into subtasks. In the following, we assume that 
the DAG represents any of the situations just described, and we will call the DAGs 
task graphs. For scheduling, it is not relevant how the DAG was actually generated. 


Example 6.4 A legal schedule for a simpler task graph including message transmis- 
sion is shown in Fig. 6.11. Task t3 can be executed only after task tı and t2 have 
completed and sent messages to T3. V 
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Latest Deadline First (LDF) Algorithm 


An optimal algorithm for minimizing the maximum lateness for the case of 
simultaneous arrival times of dependent tasks or jobs was presented by Lawler 
[326]. The algorithm is called latest deadline first (LDF). LDF reads the task graph. 
Among all tasks with no successors, it picks the one with the latest deadline and puts 
it into a queue. It then repeats this process, always selecting the task with the latest 
deadline among tasks whose successors have all been selected and inserting it into 
the queue. At run-time, the tasks are executed in an order opposite to the order in 
which tasks have been entered into the queue. LDF is non-preemptive and is optimal 
for uniprocessors. 


Example 6.5 Consider the case of Fig. 6.11. LDF would first store t3 in a queue, 
since it has no successor. As a result, successors of tj and t2 have all been selected 
already. Which of the two is stored in the queue first depends on their deadline. The 
node having the later deadline is stored in the queue first. At run-time, the queue is 
processed in reverse order, starting, for example, with t1. V 


The case of asynchronous arrival times can be handled with a modified EDF 
algorithm. The key idea is to transform the problem from a given set of dependent 
jobs into a set of independent jobs with different timing parameters [98]. This 
algorithm is again optimal for uniprocessor systems. 

If preemption is not allowed, the heuristic algorithm developed by Stankovic 
and Ramamritham [508] can be used. 


6.2.3 Periodic Scheduling Without Precedence Constraints 


Next, we will consider the periodic case. We will consider mostly tasks instead 
of jobs, since most properties for periodic systems can be derived for tasks. We 
will restrict ourselves to a description of the case in which tasks are independent, 
described as (1lr;,prmp,periodicl. .. ) in the triplet notation. 


Notation 


For periodic scheduling, objectives relevant for aperiodic scheduling are less useful. 
For example, minimization of the total length of the schedule is not an issue if we 
are talking about an infinite repetition of jobs. The best that we can do is to design 
an algorithm which will always find a schedule if one exists. This motivates the 
definition of optimality for periodic schedules. 


Definition 6.9 For periodic scheduling, a scheduler is defined to be optimal iff it 
will find a feasible schedule if one exists. 
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Definition 6.10 For periodic and sporadic task systems T = {T1, .., Tn}, we define 
task utilization as 
Ci 
u= = 6.4 
ET, (6.4) 
This means that for sporadic task systems, we are using the same definition as for 
periodic systems, even though T; just denotes the minimum separation of jobs. 


Definition 6.11 For a task system t = {t1 ... Tn} with utilization u; of task t;, we 
define the maximum and the total utilization by 


Umax = OR (ui) (6.5) 


Usum = X ui (6.6) 


i 


Rate Monotonic Scheduling 


Rate monotonic (RM) scheduling [348] is probably the most well-known scheduling 
algorithm for independent periodic tasks. Rate monotonic scheduling is based on the 
following assumptions (“RM assumptions”): 


. All tasks that have hard deadlines are periodic. 

. All tasks are independent. 

. Di = T;, for all tasks. 

. C; is constant and is known for all tasks. Self-suspension (voluntarily relinquish- 
ing the execution) is not allowed. 

5. The time required for context switching is negligible. 

6. For a single processor and for n tasks, the accumulated utilization U;,;, does not 

exceed the following bound: 


RWN Fe 


n 


C; 
Usum = Dz <n" —1) (6.7) 


i=l i 
Figure 6.12 shows the bound of constraint (6.7). 
The bound is about 0.7 for large n: 
lim n * (2'/" — 1) = loge(2) = In(2) ~ 0.7 (6.8) 
n—> oo 


Then, according to the policy for rate monotonic scheduling, the priority of 
tasks is a monotonically decreasing function of their period. In other words, 
tasks with a short period will get a high priority, and tasks with a long period will 
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Fig. 6.13 Example of a schedule generated with RM scheduling V 


be assigned a low priority. RM scheduling is a preemptive scheduling policy with 
fixed priorities. 


Example 6.6 Figure 6.13 shows a schedule generated with RM scheduling. Task t2 
is preempted several times. Double-headed arrows indicate the arrival time of a job 
as well as the deadline of the previous job. Tasks t1 to t3 have a period of 2, 6, and 
6, respectively. Execution times are 0.5, 2, and 1.75. Task tı has the shortest period 
and, hence, the highest rate and priority. Each time task t} becomes available, its 
jobs preempt the currently active task. Task t2 has the same period as task t3, and 
neither of them preempts the other. 


Constraint (6.7) requires that some of the computing power of the processor is 
not used in order to make sure that all requests are honored in time. What is the 
reason for this bound on the utilization? The key reason is that RM scheduling, due 
to its static priorities, will possibly preempt a task which is close to its deadline in 
favor of some higher-priority task with a much later deadline. The task having a 
lower priority can then miss its deadline. 


Example 6.7 In Fig. 6.14, task parameters are Ti = 5, C1 = 3, To = 8, and C2 = 
3. In this case, we have Usym = 2+ Š = 2 = 0.975. This value exceeds the bound: 


2* (23 — 1) ~ 0.828. Not enough idle time is available to guarantee schedulability 
for RM scheduling. Hence, schedulability is not guaranteed for RM scheduling, and 
in fact, the deadline is missed at time 8. We assume that the missing computations 
are not scheduled in the next period. V 


Such missed deadlines cannot happen if the utilization of the processor is very 
low, and obviously, they can happen when the utilization is high, as in Fig. 6.14. 
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Fig. 6.14 RM schedule does not meet deadline at time 8 


If the constraint (6.7) is met, the utilization is guaranteed to be low enough to prevent 
problems like that of Fig. 6.14. Constraint (6.7) is a sufficient condition. This means 
we might still find a schedule if the condition is not met. Other sufficient conditions 
exist [54]. 

RM scheduling has the following important advantages: 


e We can show that it is an optimal fixed priority preemptive scheduling algorithm 
for uniprocessor systems [54]. 

e It is based on static priorities, enabling its application in an operating system 
with fixed priorities. 

e Ifthe above six RM assumptions (see p. 312) are met, all deadlines will be met 
(see Buttazzo [81]). 


RM scheduling is also the basis for a number of formal proofs of schedulability. 
Designing examples and proofs is facilitated if the most problematic situations for 
scheduling are known. To get started, we assume the following property: 


Property 6.1 We assume that every job completes before the next job of the same 
task is released. 


Definition 6.12 A critical instant for a task q; is defined to be an instant t at which 
a release of that task will have the largest response time. 


Theorem 6.4 (Critical Instant Theorem) For fixed priority scheduling, the 
response time for execution on a uniprocessor system is maximized for each task 7; 
if t; is released at the same time as all tasks having a higher priority. 


Proof Here we present the original proof by Liu and Layland [348], using the 
wording of these authors (except for making the notation consistent with ours): “Let 
T = {T1,.--, Tn} denote a set of priority-ordered tasks with t, being the task with 
the lowest priority. Consider a particular request for t, that occurs at t1. Suppose 
that between ¢ and t1 + Tn, the time at which the subsequent request of t, occurs, 
requests for task t;, i < n, occur at t2, t2 +T;, t2+2T7;, ... , 2 +kT;j, as illustrated 
in Fig. 6.15. Clearly, the preemption of t, by t; will cause a certain amount of delay 
in the completion of the request for t, that occurred at t;, unless the request for Tn is 
completed before t2. Moreover, from Fig. 6.15 we see immediately that advancing 
the request time f2 will not speed up the completion of t,. The completion time is 
either unchanged or delayed by such an advancement. Consequently, the delay in 
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Fig. 6.15 Delaying task t, by some t; of higher priority 


the completion of Tn is largest when t coincides with t;. Repeating the argument 
for all t;, i = 2,...,m— 1, we prove the theorem.” oO 


Implicitly, we have used Property 6.1 in the proof. If we consider the general case 
(i.e., the situation in which the assumption of Property 6.1 does not hold; see, for 
example, Baker [35]), Theorem 6.4 remains valid, but the proof becomes more 
complex, as shown by Devillers et al. [129] and Bril [69].5 

The critical instant theorem is of great help when scheduling uniprocessor 
systems. In general, the critical instant theorem does not hold for multiprocessor 
systems, which makes proofs much harder. So, the validity of this theorem should 
really be appreciated! 

Let us look at other properties of RM scheduling now. The idle time or spare 
capacity of the processor is not always required. 


Theorem 6.5 Let t be a system of periodic tasks. If the period of all tasks is a 
multiple of the period of the task having the next higher priority, t can be scheduled 
with RM scheduling if 


Usum < 1 (6.9) 


Example 6.8 This requirement is met if tasks in a TV set must be executed at rates 
of 25, 50, and 100 Hz (or 30, 60, and 120 Hz). V 


Proof of Theorem 6.5 Let tasks be sorted by priorities, such that Vi : T; < Ti+1. 
Consider some task t; and the task with the next lower priority, task t;+ı (see 
Fig. 6.16). Note that the second deadline of t;+1 matches the fourth deadline of 
ti neatly. Therefore, we can fold the execution times of task t;+1 into the execution 
times of t; and create a new task T; + containing the execution times of the two 
original tasks. This folding is feasible if the total execution time of the two tasks 
does not exceed the period of t;+1. The process can be repeated in the same way 
with the next lower-priority task. Overall, folding is feasible as long as the overall 
utilization does not exceed 1. o 


The bounds in Constraints (6.7) and (6.9) allow us to check for schedulability. 


5I owe this hint to J.J. Chen of TU Dortmund. 


316 6 Application Mapping 


Tisi a U 
Fig. 6.16 Folding of tasks of adjacent priorities 
"E EES E EE BES E 


SER EES EEE ESE EEE: 


ane es a a a Tt ca 


Ln Le Le an ee 
0 2 4 6 8 10 12 14 16 18 20 22 24 t 


Fig. 6.17 EDF generated schedule for the example of 6.14 


Due to the critical instant theorem, the proof of optimality of RM scheduling 
needs to consider only the case in which tasks are released concurrently with all 
other tasks of higher priority. 


Earliest Deadline First Scheduling 


EDF can also be applied to periodic task sets. Obviously, it is sufficient to solve the 
scheduling problem for a single hyper-period. This schedule can then be repeated 
for the other hyper-periods. The hyper-period for the example of Fig. 6.14 is 40. 
It follows from the optimality of EDF for nonperiodic schedules that EDF is 
also optimal for a single hyper-period and therefore also for the entire scheduling 
problem. No additional constraints must be met to guarantee optimality. This 
implies that EDF is optimal also for the case of Usum = 1. 


Example 6.9 No deadline is missed if the example of Fig. 6.14 is scheduled with 
EDF (see Fig. 6.17). At time 5, the behavior is different from that of RM scheduling: 
due to the earlier deadline of t2, it is not preempted. V 


Explicit-Deadline Tasks 


Now we move toward the consideration of tasks whose deadline is not the same 
as the period. Such tasks are called explicit-deadline tasks. Each task t; in such 
a system is characterized a triple (C;, Di, T;), where D; is the relative deadline. 
The case D; < T; is called the constrained-deadline case. The arbitrary-deadline 
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case is characterized by the absence of such a constraint. Obviously, the class of 
explicit-deadline tasks is more general than the class of implicit-deadline tasks, and 
each implicit-deadline task is also an explicit-deadline task. 

Utilization is of limited value for the characterization of computational demands 
of explicit-deadline tasks. To some extent, density plays the role which utilization 
played to far. Density is defined as 


Ci 
dens; = ————— 6.10 
ensi = anD, T) ma 
denSsum(T) = X dens; (6.11) 
TET 
denSmax(T) = max(dens;) (6.12) 
TEt 


Density values characterize computational requirements. A tighter bound is pro- 
vided by the so-called demand bound function (DBF): 


Definition 6.13 For any sporadic task ti and any real number t > 0, the demand 
bound function DBF (t;, t) is the largest cumulative execution requirement of all 
jobs that can be generated by t; to have both their release times and their deadlines 
within a contiguous interval of length t. 


The overall execution requirements of task t; over an interval [fo,% + t) are 
maximized if one of its jobs arrives at the start of the interval—i.e., at time 
instant fg—and its subsequent jobs arrive as rapidly as permitted, i.e., at instants 
to + Ti, to + 2T;, to + 37;, .... This observation leads to Eq. (6.13) [39, 41]: 


t — Di 
DBF (ti, t) = max (0 (| 7 | + i) * cı) (6.13) 


Density and the demand bound function are related: 


Lemma 6.1 For all tasks ti and for all t > 0: 


t x dens; > DBF(t;,t) (6.14) 
Proof Let us compare the graphs depicting density and DB F as a function of time. 
Figure 6.18 shows both functions. The left hand side of Eq. (6.14) is visualized as 
the straight line with slope dens;. DBF is a step function with steps of height C;. 
Whenever a task must be executed, the step function increases by C;. The first step 
is at tf = Dj. By definition of the density, this step does not exceed the straight line. 
The next steps will be at t = D; + T;, t = Di + 2T;, t = Di + 3T;, and so on, since 
these are the intervals of time after which the demand increases by C;. Again, these 
steps will not exceed the straight line. o 
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Fig. 6.18 Comparison of DBF, dens _ < dens; 
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EDF can be easily extended to handle the case when deadlines are different 
from the periods. For RM scheduling, the extension is called deadline monotonic 
scheduling. 


Deadline Monotonic Scheduling 


Explicit-deadline tasks can be dealt with in deadline monotonic (DM) scheduling. 
For DM scheduling, static task priorities are based on nonincreasing deadlines: for 
any two tasks q; and 7;’, the priority of t; will be higher than that of t; if Dj < Dy. 

For constrained-deadline tasks, constraint (6.7) can be generalized into con- 
straint (6.15) which is sufficient, but not necessary [81]: 


n 


C; 
Y= <n!" —1) (6.15) 
i=l Di 


6.2.4 Periodic Scheduling with Precedence Constraints 


Scheduling dependent tasks is more difficult than scheduling independent tasks, in 
particular in the non-preemptive case ((1| r;, prec, periodic | Lmax) in the triplet 
notation). The problem of deciding whether or not a non-preemptive schedule exists 
for a given set of dependent tasks and a given deadline is NP-complete [178]. In 
order to reduce the scheduling effort, different strategies are used: 


e adding additional resources such that scheduling becomes easier, 

e partitioning of scheduling into static and dynamic parts. With this approach, as 
many decisions as possible are taken at design time, and only a minimum of 
decisions is left for run-time. 


6.2.5 Sporadic Events 


In the case of sporadic events, we could connect sporadic events to interrupts and 
execute them immediately if their interrupt priority is the highest in the system. 
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However, quite unpredictable timing behavior would result for all the other tasks. 
Therefore, special sporadic task servers are used which execute at regular intervals 
and check for ready sporadic tasks. This way, sporadic tasks are essentially turned 
into periodic tasks, thereby improving the predictability of the whole system. 


6.3 Scheduling for Independent Jobs on Identical 
Multiprocessors 


Next, we are going to consider multiprocessors, due to their widespread use in the 
form of multi-cores in contemporary embedded systems. A large number of issues 
have to be considered during the transition from uniprocessors to multiprocessors. 
Initially, we assume having m identical processors (or “cores”). Furthermore, 
we assume dealing with a task system t = {t1,..., Tn} where each task i is 
characterized by its worst case execution time (WCET) C; and—in case of periodic 
or sporadic tasks—its period 7; which is considered to also define the deadline 
unless otherwise noted. Whenever the periodic or sporadic nature of tasks is not 
relevant, we may also consider a set of jobs with explicit deadlines d; instead. 

For multiprocessor s, it is not sufficient to decide when to execute tasks or their 
jobs. Rather, we must decide when to execute jobs and where to execute them. 
Thus, a one-dimensional problem becomes a two-dimensional problem. 

For m identical processors, obvious necessary conditions for schedulability are 


Vi:u; <1 (6.16) 
Usum < mM (6.17) 


6.3.1 Partitioned Scheduling 


Our presentation in the next sections is based predominantly on a book written by 
Baruah et al. [41] and complemented by material from other sources like a survey 
paper by Davis et al. [119] and slides by I. Puaut [461, 462]. Baruah et al. focus on 
sporadic task systems. This is partly motivated by the fact that for such systems— 
in contrast to periodic task systems—no global time synchronization is required 
for releasing jobs. Rather, it is sufficient to maintain a time base which ensures 
that the minimum intervals T; are kept. Also, sporadic task systems are considered 
for complexity reasons. We start by considering sporadic implicit-deadline tasks on 
identical multiprocessors. In the triplet notation, this corresponds to the case (Pm | 
Di = T;, sporadicl...). 

Furthermore, we are initially restricting ourselves to the case of partitioned 
scheduling. This means that each task is allocated to a particular processor. Task 
migration is not allowed. Partitioned scheduling for synchronous arrival times can 
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be done by bin packing [306], defined in a notation adjusted for real-time scheduling 
as follows: 


Definition 6.14 Lett = {1,...,} be a set of items, where each item i € t has a 
size c; € (0, 1]. Let m = {1,...m} be a set of bins with capacity one. The problem 
of finding an assignment a : rt —> z such that the number of nonempty bins m < n 
is minimal and such that allocated sizes do not exceed the bin capacity is called the 
bin packing problem. 


Bin packing is known to be NP-hard [178]. Hence, optimal algorithms such as the 
one proposed by Korf [305] need large run-times. Formalization of the scheduling 
problem as a bin packing problem aims at the minimization of the number of 
processors m. 

For a given number m of processors, it is more appropriate to model scheduling 
for synchronous arrival times as a knapsack problem, more precisely as a 0/1 
multiple knapsack problem. This problem can be defined as follows, again using 
a notation adjusted for real-time scheduling: 


Definition 6.15 (Martello [367]) Let t = {1, ..., n} be a set of n items, each with 
a size c; and a benefit b;. Let x be a set of m knapsacks, each with a capacity Kx, 
with (m < n). Suppose that we can partially allocate a subset of items to knapsacks 
(a : T — T) such that size constraints are respected: 


Vk: =, Ci < Kk. (6.18) 


i,a:i—>k 


The problem of selecting disjoint subsets of items so that the total profit X; b; for 
items in knapsacks is maximized is called the 0/1 multiple knapsack problem 
(MKP). 


Given an algorithm for the 0/1 multiple knapsack problem, we can allocate jobs 
to m processors. For identical processors, capacities would all be equal. For uniform 
processors, we can use capacities to take processor speeds into account. The 0/1 
multiple knapsack problem is NP-hard as well. Note that we would possibly not 
schedule all tasks. 

Due to the complexity of scheduling for synchronous arrival times, there is 
no hope for efficient optimal algorithms for the general problem, and in practice, 
heuristics are used. Common heuristics are considering tasks and processors in a 
certain sequence. Heuristics differ by the sequence they use. Lopez et al. [355] have 
compared several heuristics. They restrict themselves to the so-called reasonable 
allocation algorithms, defined as follows: 


Definition 6.16 A reasonable allocation (RA) algorithm is defined as one that 
fails to allocate a task to a multiprocessor platform only when the task does not fit 
into any processor upon the platform. 


Definition 6.17 A reasonable allocation decreasing (RAD) algorithm is defined 
as an RA algorithm considering tasks in a nonincreasing order of utilization. 


6.3 Scheduling for Independent Jobs on Identical Multiprocessors 321 


The algorithms studied by Lopez et al. are obtained by combining all possible 
combinations of two characteristics: 


1. The order in which tasks are considered: tasks can be considered in decreasing 
order of utilization (denoted by D), in increasing order of utilization (denoted by 
I), and in arbitrary order (denoted by an empty character). 

2. The search strategy for processor allocation: we consider processors to be 
ordered in some way. Then, the first fit strategy (FF) will allocate the first 
processor on which it fits. The worst fit strategy (WF) will allocate the processor 
with the largest remaining capacity. The best fit strategy (BF) will allocate the 
processor with the minimum remaining capacity on which it fits. 


There are a total of nine combinations. All combinations can be implemented 
efficiently. For example, algorithm FFD can be detailed as follows: 


Sort task set according to nonincreasing utilizations u; = Ci / Ti; 
/x Assume task set is renumbered according to the sorting;*/ 


for (mt=0; mt < m; mt++) K[mt] =1; /x initialize capacity */ 
for (i=1; i<n; i++) { /* for each task */ 
for (mt=1; (uj >K[mt]) and (mt<m); mt++); /x* sufficient capacity? */ 
if (mt > m) mt=0; /* no solution, use index Q */ 
a[i]=mt; /x return processor allocation in array */ 
K[mt]=K[mt]-u; ; /* update remaining capacity */ 
} 


The heuristic algorithm is certainly not optimal. There may be the question: how 
far are we off the optimum? Many publications discuss upper bounds on the number 
of additional processors needed, if compared to the minimum number of processors 
needed for optimal bin packing. The paper by Dosa [136] is an example of this. 
For real-time systems, a different question is relevant: is there, for a given number 
of processors, any bound on the overall utilization up to which schedulability is 
guaranteed? One utilization bound was proved by Lopez et al. [355]: 


Theorem 6.6 Any reasonable allocation algorithm has a utilization bound no 
smaller than 


Upi (Umax) = m — (m — 1)U max (6.19) 


Proof When a task with utilization u; cannot be allocated, every processor must 
have tasks allocated to it with a per processor utilization exceeding (1 — u;). The 
overall utilization over all allocated tasks and including t; must then exceed: 


m(l — u;)+ u; = m — (m — lui (6.20) 
> m-— (m— l)Umax (6.21) 
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This condition must be met for allocation not to be feasible. oO 


Furthermore, define £ as 


1 
= | z| (6.22) 


B is a lower bound on the number of tasks of our task set which we can run on 
a single processor. Let us assume that EDF is used for local scheduling on each 
processor. Lopez et al. also showed the following theorem: 


Theorem 6.7 No allocation algorithm can have a utilization bound larger than 


m+ 1 


Up2(B) = B+ 


(6.23) 


Proof See Lopez et al. [355]. o 


Lopez et al. also proved that WF and WFI have Eq. (6.19) as their lower bound; 
the remaining algorithms have Eq. (6.23) as their lower bound. Whenever Umax 
approaches 1, the bound in Eq. (6.19) also approaches 1: 


Ugı(1)=1 (6.24) 
When Umax gets close to 1, 8 becomes 1, and Ug2 becomes 


m+ 1 
Up2(1) = Ea (6.25) 
The bound in Eq. (6.25) allows us to use multiple processors in a much more 
efficient way compared to the bound in Eq. (6.24). Hence, with respect to these 
bounds, WF and WFI are inferior to the other seven algorithms. Experimentally, 
it has been shown that FFD seems to be superior to FF or FFI and BFD seems 
to be superior to BF and BFI [41]. There is also some theoretical evidence which 
supports this observation [41]. 
The sketched nine algorithms are relatively simple algorithms. We refrain from 
presenting more elaborate algorithms for the same problem since the problem 
considered is too much simplified to apply to realistic applications: 


e The scheduling problem, as it has been addressed in this section, is a very much 
restricted one. There are no precedences, no preemption, and only identical 
processors. 

e Partitioned scheduling may lead to unused processor resources even in situations 
where jobs are available. This means that partitioned scheduling is not work- 
conserving. Therefore, optimality is not guaranteed. 
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Hence, the information in this section provides fundamental knowledge, but practi- 
cal applications require more sophisticated approaches, like the ones to be presented 
in the following sections. 


6.3.2 Global Dynamic-Priority Scheduling 


Having unused processors in the presence of available jobs can be avoided with 
global scheduling. For global scheduling, the allocation of processors to tasks 
or jobs is dynamic. This gives us more flexibility, especially in the presence 
of changing workloads or processor availabilities. In the absence of execution 
constraints, upper bounds on the utilization like the ones in Constraints (6.19) 
and (6.23) are replaced by 


Usum < M (6.26) 
However, this better utilization bound and flexibility comes at the price of a certain 
overhead for scheduling decisions, preemptions, and job migrations. 


Proportional Fair (Pfair) Scheduling 


The key idea of proportional fair (pfair) scheduling [40] is to execute each task at 
a rate corresponding to its utilization. For example, if u; = 0.5 for a set of tasks, 
then each task should be executed approximately half of the time, regardless of the 
number of processors. For pfair scheduling, we assume that time is quantized and 
enumerated with integers. Also, C; and T; parameters are represented by integers. 


Definition 6.18 The lag of a task q; at time t with respect to schedule S, denoted as 
lag(S, ti, t), is the difference between the number of slots that a task has received 
and the number of slots that it should have received: 


t—1 
lag(S, ti, t) = u; *t — È alloc(S, ti, u) (6.27) 
u=0 


The first term is the target execution time of task q;; the second is the time during 
which this task has been executed in schedule S. A schedule is said to be a pfair 
schedule if the lag remains in the interval (—1, +1). 


The presentation of pfair scheduling is based on slides by I. Puaut [462]. 
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Fig. 6.20 Intervals for allocated execution time 


Example 6.10 Figure 6.19 shows the function of actually executed time as a 
function of real time. The amount of executed time should not reach the two dashed 
lines. 


For pfair scheduling, we divide each task t; into subtasks g , Where j enumerates 
the execution intervals. For each subtask, we define a pseudo-release time and a 
pseudo-deadline: 


r(t,/) = É 7 | (6.28) 
d(t,!) = Fa (6.29) 


Example 6.11 Consider a task t; with C; = 8 and T; = 11. Possible intervals for 
the number of allocated execution slots for each j are shown in Fig. 6.20. 


For example: 
6-1 55 
6) _ —_ = — = 
ro = a e 


o| © |_1%]_ 
afie |ala = 


Hence, the sixth subtask of task t; must be executed in time interval (6:9). V 
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One particular approach for allocation of a correct number of execution slots is 
presented in the book by Baruah et al. [41]. In general, there are variations of this 
scheme: we can apply EDF to pseudo-deadlines, or we can modify EDF by defining 
rules which are applied in case of ties. It is feasible to obtain schedulability for full 
processor utilization, i.e., for Usum < mM. 

Pfair scheduling potentially suffers from a large number of migrations between 
processors. Also, due to the integer (over-)approximation of execution times, it is 
not work-conserving. Variants have been proposed which reduce the overhead for 
job migrations. Also, the overall complexity can be reduced with some variants. 

Pfair scheduling finds many applications in operating systems, for example, for 
scheduling virtual machines. 


6.3.3 Global Fixed-Job-Priority Scheduling 
G-EDF Scheduling 


We can also try to solve the two-dimensional problem with extensions of uniproces- 
sor scheduling algorithms. For example, we could use global EDF (G-EDF). G-EDF, 
just like EDF, defines job priorities based on the closeness of the next deadlines. 
If m processors are available, those m jobs having the highest priorities among 
all available jobs are executed. Obviously, such priorities are job-dependent and 
not just task-dependent. In a global scheduling strategy, we would like to keep 
preemptions and task migrations to a minimum. For G-EDF, these numbers depend 
on how we allocate tasks/jobs to a particular processor [189]. 


Lemma 6.2 G-EDF is not optimal. 


Proof The proof is by counterexample, adopted from Cho et al. [102]. Suppose 
m = 2 and C, = 3, Dı = 4, Co = 2, D2 = 3, C3 = 2, and D3 = 3. As shown 
in Fig. 6.21 (left), G-EDF schedules J2 and J3 first, due to their earlier deadline. J, 
misses its deadline. However, a schedule is feasible, as shown in Fig. 6.21 (right). 
0 
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Fig. 6.21 Left, G-EDF violates deadline at t = 4; right, feasible schedule 
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Fig. 6.22 Dhall effect ti ta 


Obviously, the problem for G-EDF results from not being able to use the second 
processor for t > 2. 

In general, G-EDF may suffer anomalies like the so-called Dhall effect [130]: 
periodic task sets for which one task has a utilization close to one cannot be 
scheduled with G-EDF. 


Example 6.12 To demonstrate the effect, let us consider the case of n = m + 1 and 


Vi € [1..m] : T; = 1, C; = 2e, uj = 2e (6.30) 

fri = 1 +£, Cmy1 = l, Um+1 ae (6.31) 

A corresponding schedule is shown in Fig. 6.22. Initially, only tasks T1, .., Tm 
are executed. The execution of task Tm+1 starts only after the first m tasks have 
completed their execution, and it will miss its deadline. The presence of a single 
task Tm+1 with a high utilization is sufficient to cause a deadline miss at t = 1 + €. 
This happens even though the utilization of the other tasks is very small. In fact, the 
utilization of tasks T1, ..Tm can be arbitrarily small, and we will still have a deadline 
miss. V 


This motivates using variants of algorithms which assign high priorities to tasks 
with a high utilization, regardless of their deadline or period. 

Algorithm fpEDF is such an algorithm. We assume that we are given an implicit- 
deadline sporadic task system t = {t1,...Tn} and that tasks are ordered by 
nonincreasing utilizations u;. Our goal is to schedule these tasks on m identical 
processors while avoiding the Dhall effect. Algorithm fpEDF works as follows [41]: 


for (i=1; i < m—1; i++){ 
if (uj >0.5) t;?s jobs obtain highest priority (ties broken arbitrarily) 
else break; 

} /x Remaining jobs get priorities according to EDF. */ 


This means that the m — 1 tasks of largest utilization will obtain the highest 


priority if their utilization exceeds a value of 0.5. 
Theorem 6.8 Algorithm fpE DF has a utilization bound no smaller than wed 


This is the best bound which we can expect unless some additional information is 
known, as is evident from the following theorem. 
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Fig. 6.23 G-EDF: left, missed deadlines; right, ZL improvement 


Theorem 6.9 No m-processor fixed-job-scheduling algorithm has a schedulable 


utilization greater than mol 


The proofs of both theorems can be found in [41]. As in the case of partitioned 
scheduling, stronger bounds are feasible if the largest utilization is known. 

A similar idea is used in scheduling algorithm EDF(k): for EDF(k), k tasks of 
highest utilization obtain the highest priority, breaking ties arbitrarily. All other tasks 
are scheduled according to EDF. 


Theorem 6.10 EDF(k) will schedule t on m unit-speed (homogeneous) processors, 
where t is an implicit-deadline sporadic task system. 


(k+1) 
mağ plee J (6.32) 


1 — uk 


and U(t“+") is the utilization for the task set with the first k tasks removed. 


The proof of this theorem can again be found in [41]. 


EDZL Scheduling 


Obviously, G-EDF can miss deadlines for task sets that are schedulable. We can 
improve G-EDF by adding a consideration of laxity: the EDZL algorithm applies 
G-EDF as long as the laxity of jobs is greater than zero (see [41, Chapter 20]). 
However, whenever the laxity of a job becomes zero, the job gets the highest priority 
among all jobs, even including currently executing jobs. 


Example 6.13 Consider the example in Fig. 6.23, adopted from Puaut [461]. In this 
example, parameters are as follows: n = 3,m = 2, T = h = T3 = 3, and 
Cı = Co = C3 = 2. For this example, G-EDF misses the deadlines for t3 at times 
t = 3n for n = 1,2,3.., as can be seen in Fig. 6.23 (left). However, EDZL keeps 
the deadlines as can be seen in Fig. 6.23 (right). The detailed behavior depends 
somewhat on the processor allocation used by EDZL. V 
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Fig. 6.24 Schedule generated by G-RM 


EDZL is strictly superior to EDF, as shown by Choi et al. [101]. Informally, 
this can be shown as follows:’ suppose that S is a schedule from EDF, and S’ is a 
schedule from EDZL for the same input task set. If a job at time f is scheduled in 
EDZL but not in EDF, then the job misses the deadline in EDF but not in EDZL. If 
both schedule the job, then the schedule remains the same. That is, the first moment 
when S differs from S’ has the following results: 


e either EDZL remains feasible but EDF becomes infeasible 
¢ or both EDZL and EDF are infeasible. 


Therefore, EDZL is superior to EDF. Piao et al. [452] proved the following 
utilization bound for EDZL 


m+ 1 
2 


Us um < 


(6.33) 


6.3.4 Global Fixed-Task-Priority Scheduling 
Global Rate Monotonic Scheduling 


In a similar way, we can extend rate monotonic scheduling to global rate monotonic 
scheduling (G-RM). For G-RM, there is an anomaly concerning relaxed schedules: 


Lemma 6.3 For G-RM, there may be situations in which schedules exist for a 
certain task system, but deadlines are violated if periods are extended. 


Proof We prove the existence of such situations by means of an example, adopted 
from Puaut [461]. Consider the case m = 2, n = 3, Ti = 3, Cy = 2, To = 4, 
C2 = 2, T3 = 12, and C3 = 7. Figure 6.24 shows a schedule generated by G-RM. 
If we extend the period of tı to T) = 4, t3 will miss its deadline (see Fig. 6.25). 


TI owe this informal explanation to J.J. Chen, TU Dortmund. 
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Fig. 6.25 Schedule with a missed deadline at t = 12 generated by G-RM 


This counterintuitive result makes the design of proofs and examples much more 
complex, compared to the uniprocessor case. o 


The critical instant theorem for uniprocessors (see p. 314) is also not valid for multi- 
core systems. 
The following utilization bound has been shown for G-RM:° 


Theorem 6.11 Any implicit-deadline periodic or sporadic task system t satisfying 
m 
Usum < aY — Umax (1)) + Umax (T) (6.34) 


is successfully scheduled by G-RM on m unit-speed (homogeneous) processors [50]. 


G-RM also suffers from the Dhall effect: note that Usum in Eq. (6.34) approaches 
zero as Umax approaches one. Also, like G-EDF, the algorithm cannot fully exploit 
the presence of multiple processors. 

Therefore, algorithm RM-US(&) with threshold € has been proposed, where US 
stands for utilization threshold. Given an implicit-deadline sporadic task system tT = 
{T1, . - - Tn} and tasks ordered by nonincreasing utilizations u;, the goal is to schedule 
these up to (m— 1) high utilization tasks on m— 1 identical processors while avoiding 
the Dhall effect, leaving at least one processor for the remaining tasks. RM-US(&) 
works as follows: 


for (i=1; i<m-— 1; i++) { 
if (uj > €) t is assigned highest priority 
else break; 
F /* remaining tasks are allocated according to G-RM */ 


Theorem 6.12 Algorithm RM-US(&) has a utilization bound no smaller than 
aa upon m unit-speed processors. 

The proof was published by Andersson et al. [16]. For 3m >> 2, this bound 
approaches 4. A tighter bound was shown by Chen et al. [97]. 


8 tighter bound has been shown by Chen et al. [97]. 
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RMZL Scheduling 


G-RM might miss deadlines for task sets that are schedulable, and we can consider 
improvements. One such improvement is RMZL scheduling. For RMZL scheduling, 
we use (G-)RM scheduling as long as the current laxity is larger than zero. However, 
when the laxity becomes zero for one of the jobs, we raise its priority to the highest. 
RMZL scheduling is superior to RM scheduling, since schedules are changed only 
when RM scheduling could have missed a deadline [41]. 


Partitioned Scheduling for Explicit Deadlines 


Partitioned scheduling for explicit-deadline task systems can be done similar to 
partitioned scheduling for implicit-deadline task systems by replacing sorting by 
utilization with sorting by density. However, this approach is not recommended, 
since density can be unbounded in certain cases. Baruah et al. present a better 
approach for partitioned scheduling [41]. 


6.4 Dependent Jobs on Homogeneous Multiprocessors 


Results presented in the previous section constitute fundamental basic knowledge, 
but the restriction to independent tasks and identical processors inhibits their appli- 
cation for many design problems. Next, we will be dropping these restrictions. First 
of all, we will be dropping the restriction to independent tasks and focus on some 
simple algorithms used in the design automation community. For example, as-soon- 
as-possible (ASAP), as-late-as-possible (ALAP), and list (LS) and force-directed 
scheduling (FDS) are very popular for automated synthesis from algorithmic design 
descriptions, the so-called high-level synthesis (HLS) (see, for example, Coussy 
[113]). 


6.4.1 As-Soon-as-Possible Scheduling 


Considering precedence constraints, as-soon-as-possible (ASAP) scheduling tries 
to schedule each task as soon as feasible. ASAP scheduling, as used in HLS, 
considers a mapping of tasks to integer start times: S : t — No. Allocation to 
specific processors has to be performed after ASAP scheduling. Preemptions are 
not allowed. 

We assume that the execution times of all tasks are known and that they are 
independent of the processor executing the tasks. Hence, we are assuming that 
processors are homogeneous. The algorithm does not consider any constraints on 
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the number of processors and assumes that the number of processors needed for the 
resulting schedule is available. The ASAP algorithm works as follows: 


for (t=0; there are unscheduled tasks; f++) { 
t/={all tasks for which all predecessors finished}; 
set start time of all tasks in t’ to t; 


} 


Example 6.14 Let us assume that the task graph of Fig. 6.26 (left) is given. 

Each node labeled i represents a task t;. Furthermore, let us assume that 
execution times correspond to those listed in Fig. 6.26 (right). 

Then, ASAP scheduling will generate the schedule shown in Fig. 6.27. Numbers 
in blue denote start times; numbers in green denote finish times. Tasks T2 to T6 
all start immediately after task tı has finished, since they do not depend on any 
other task. Also, tasks t7 to To start as soon as the last of their predecessors has 
finished, and the same holds for task t19. The red line in Fig. 6.27 (right) shows that 
a maximum of five processors is needed, since ASAP scheduling does not consider 


any constraints on the number of processors. V 
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Fig. 6.26 Left, task graph; right, execution times of tasks 


Fig. 6.27 Left: ASAP scheduled task graph; right: time line 
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ASAP scheduling minimizes the makespan, since all tasks are scheduled as early 
as possible. The presented algorithm could be extended to also cover real numbers 
as execution times. We may consider ASAP scheduling to be of linear complexity, 
provided that we use a clever technique for computing t’. The algorithm can also 
be applied to personal life, corresponding to a situation where each person is eager 
to perform available work as early as possible. 


6.4.2 As-Late-as-Possible Scheduling 


As-late-as-possible (ALAP) scheduling is the second simple scheduling algorithm 
for dependent tasks. For ALAP scheduling, all tasks are started as late as possible. 
The algorithm works as follows: 


for (t=0; there are unscheduled tasks; t--) { 
t/={all tasks on which no unscheduled task depends}; 
set start time of all tasks in t’ to (t - their execution time); 


} 
Shift all times such that the first tasks start at f=0. 


The algorithm starts with tasks on which no other task depends. These tasks are 
assumed to finish at time 0. Their start time is then computed from their execution 
time. The loop then iterates backward over time steps. Whenever we reach a time 
step, at which a task should finish the latest, its start time is computed, and the task 
is scheduled. After finishing the loop, all times are shifted toward positive times 
such that the first task starts at time 0. We could also consider ALAP scheduling as 
a case of ASAP scheduling starting at the “other” end of the graph. 


Example 6.15 For the task graph in Fig. 6.26, ALAP scheduling would generate the 
result shown in Fig. 6.28. The color coding is the same as for the ASAP example. 
Note that each task finishes as late as possible. In particular, tasks t7 to tọ finish only 
at time 34. Tasks t4 to te finish later than for the ASAP schedule. Tasks T1, T2, To, 
and T,9 are scheduled as in the ASAP schedule, since these tasks determine the 
makespan. Tasks which determine the makespan are said to be on the critical path. 
Five processors are needed, as indicated by the red line. V 


This scheduling strategy can also be applied to personal life. It corresponds to a 
situation where each person (is lazy and) finishes tasks as late as possible. Many 
processors are needed if the task graph is very wide at its lower end.? 


°This corresponds to a lot of work in the final phase if people start lazy. 
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Fig. 6.29 Running example, left: mobility; right, number of successors 


6.4.3 List Scheduling 


With list scheduling (LS), we try to maintain the low complexity of ASAP and 
ALAP scheduling while making the algorithm aware of available processors. 
Processors may be of different types, but we do still assume that there is a 
one-to-one mapping between tasks and processor types. Hence, processors may 
be heterogeneous, but the crucial mapping from tasks to processor types is not 
generated by list scheduling. 

We assume that we have a set L of processor types. List scheduling respects 
upper bounds B; on the number of processors for each type / € L. 

List scheduling requires the availability of a priority function reflecting the 
urgency of scheduling a certain task t;. The following urgency metrics are in use 
[528]: 


e Mobility is defined as the difference between the start times for the ASAP and 
ALAP schedule. Figure 6.29 (left) shows the mobility for our running example 
in red. Obviously, scheduling is urgent for the four tasks on the critical path for 
which mobility is zero. 

¢ The number of nodes below task 1; in the tree (see Fig. 6.29 (right)). 

¢ The path length for a task 1; is defined as the length of the path from starting 
at ti to finishing the entire graph G. The path length is typically weighted by 
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Fig. 6.30 Left, task graph with path lengths; right, time line for path length based list scheduling 


the execution time associated with the nodes, assuming that this information is 
known. In Fig. 6.30 (left), path lengths have been added. 


List scheduling requires the knowledge of the task graph G = (rt, E) to be 
scheduled, a mapping from each node of the graph to the corresponding resource 
type l € L, an upper bound B; for each /, a priority function (as just explained), and 
the execution time for each task t; € Tt. List scheduling then fits nodes of maximum 
priority into each of the time steps such that the constraints are not violated [528]: 


for (t=0; there are unscheduled tasks; f++) /* loop over times */ 
for (CeL) { /* loop over resource types */ 
7: = set of tasks of type / still executing at time t; 
a = set of tasks of type l ready to start execution at time t; 
Compute set t/ C ca of maximum priority such that 
kalsel < Bi. 
Set start times of all tj Et to t: 5; =f; 


} 


Example 6.16 Figure 6.30 shows the result of list scheduling as applied to our 
example in Fig. 6.26, using path length as priority. We assume that all processors 
are of the same type and that we allow no more than three processors (By = 3). 
At time 9, tasks T2, T4, and t5 have the longest path length and hence the highest 
priority. t4 finishes at time 17, and t3 and te have the longest path length of the 
remaining tasks. We assume that we schedule 73. At time 19, t5 finishes and t6 can 
be started. At time 28, t3 and t6 finish, freeing processors for t7 and tg. t7 finishes 
at time 35, enabling dependent task t 9 to start and to finish at time 42, only slightly 
later than in the ASAP and ALAP schedules, despite using only three processors. 

V 


LS—like ASAP and ALAP scheduling—does not allocate tasks to processors, 
but there is also no need for doing this for the restricted resource model. LS can also 
be extended to real numbers as execution times. The algorithm typically generates 
good results and is easy to adapt to various scenarios. These two features make LS 
a very popular scheduling algorithm for tasks with precedences. 
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Force-directed scheduling (FDS) is another heuristic scheduling algorithm for 
dependent tasks. FDS aims at an efficient use of processors. It tries to balance the 
number of processors that may be needed at any particular time [449]. 


6.4.4 Optimal Scheduling with Integer Linear Programming 


Next, we will be describing an approach for mapping tasks to multiple processors for 
which decisions are taken on a more global view of the design problem. It is based 
on integer linear programming (ILP) (see Appendix A). In this way, constraints and 
optimization goals are made explicit. We are adopting material from a publication 
of Coscun et al. [112] in our presentation. 
ILP models consist of a linear cost function and a set of linear constraints. We 
will use the following variables in these two parts of the model: 
Xi,k : = 1 if task q; is executed on processor zz and =0 otherwise 
si : Start time of task 1; 
fi : finish time of task q; 
C; : execution time of task t; 
bj,; : = 1 if task qt; is executed before t; on the same processor, else = 0 
Let us assume that our task graph G = (t, E) has a common exit node T,,;;. If no 
such node is initially present, we add a virtual node. The finish time of this node is 


equivalent to the makespan M Smax. We can use this finish time as our cost function 
to be minimized. Hence, the objective of ILP minimization can be expressed as: 


Min fei.) (6.35) 


First, the set of constraints ensures that each task is executed on some processor: 


Veet: So wpel (6.36) 
ke{1..m} 


Second, the different times are related by the following equations: 
Yr; ert: ff=sit+Ci (6.37) 
Third, in order to respect precedence relations, the following equations can be used: 
Y(t tj) EE:sj—-fiz0 (6.38) 


Fourth, in a single core, execution is in a sequence as determined by variable b;, ;: 
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V(t, Tj) Lis Sj if bi, j =1 (6.39) 


Fifth, each processor can execute only a single task at a time: 
V(t, Tj) $ bi,j + bji = 1if3 Tk > Xik = Xj,k 5 1 (6.40) 


Equations (6.39) and (6.40) can be turned into the linear form required for ILP [112]. 
The resulting ILP model can be fed into some available ILP solver. ILP 
models have the advantage of precisely modeling the design problem and the 
objectives. They enable optimizations from a global viewpoint, using mathematical 
optimization techniques and stepping away from imperative programming. 

The ILP problem is NP-hard. Therefore, run-times of ILP solvers can become 
large, but there has been significant progress in the design of ILP solvers. Hence, 
moderately large problems can be solved in acceptable times. However, due to 
the complexity of ILP, these approaches do not scale to really large designs, and 
run-times may be unacceptable. Nevertheless, these models can be used for exact 
optimization of moderately large design problems and serve as a good starting point 
for heuristics for larger problems. 


6.5 Dependent Jobs on Heterogeneous Multiprocessors 


6.5.1 Problem Description 


Next to dropping the restriction to independent tasks, we would like to drop the 
restriction to homogeneous processors. We assume that the processing speeds of 
processors of our execution platform m = {7t1, ..., m} are unrelated. According 
to Pinedo’s triplet notation, we are considering the case (Rml|ri, prec, ... |...) 
including platforms comprising a mixture of execution units, like FPGAs, GPUs, 
etc. 

The theory of the resulting scheduling problems has not been studied com- 
prehensively. As a result, Baruah et al. [41] state (in Chapter 22): “although 
unrelated multiprocessors are becoming increasingly more important in real-time 
systems implementation, the resulting scheduling theoretic study of such systems is, 
relatively speaking, still in its infancy.” Some first results are presented in the book 
by Baruah, but we resort to presenting methods published in the design automation 
community. They can handle realistic design tasks, sacrificing proofs of optimality. 
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6.5.2 Static Scheduling with Local Heuristics 


We will now describe the heterogeneous-earliest-finish-time (HEFT) and the 
critical-path-on-a-processor (CPOP) algorithms for static scheduling of tasks 
in a task graph G = (t,£) onto a heterogeneous multiprocessor system 
mw = {71,..., Xm} [545]. These two algorithms are standard examples of fast 
algorithms. In a way, they extend ASAP and ALAP scheduling for heterogeneous 
processors. This is the notation we need: 


e We assume that the task graph has a common entry node Tentry. If no such node 
is initially present, we will add an artificial node having zero execution time and 
communication bandwidth requirements. 

e We assume that the task graph has a common exit node Texit. If no such node is 
initially present, we will add an artificial node having zero execution time and 
communication bandwidth requirements. 

e Matrix C = (ci k) denotes the execution time of task t; on processor zx. 

e Matrix B = (bx) denotes the communication bandwidth for communication 

from processor 7x to processor 71. 

e Matrix data = (datai j) represents the amount of data which must be 

transmitted from task q; to task tj. 

e Vector k = (kx) contains the communication startup costs on processor zx. 

e Matrix H = (hj, j,k) describes the communication cost from task 7; to task Tj 
under the assumption that t; is mapped to processor zr, and task tj is mapped to 
processor 7.!° 

We will use index i for the source of precedences and index k for its allocated 
processor. For the sink, we use j and / accordingly. 

e For a mapping to processors 7 and 77, hj, j,k,ı represents the communication cost 
from task 7; to task qj: 


dataj,; , 
hi, jk = Kk + “ifk Al (6.41) 
k,l 
= 0ifk = (6.42) 
e The average communication cost is defined as 
_ datai j 
hi j =K + — (6.43) 


B 


where K is the average communication startup time and B is the average 
communication bandwidth. 

e Given a partial schedule, Se(Ti, 7%) is the earliest start time for task t; on 
processor 7x. Obviously, Se(Tentry, Tk) is zero, for any k. 


10Tndexes k and / are not explicit in the original paper. 


338 6 Application Mapping 


e We define f.(t;, 7) as the earliest finishing time for task t; on heterogeneous 
processor Tk. fe(Tentry, Tk) is equal to Centry,k- 
e Once the decision to schedule task t; on processor zx has been taken, the actual 
start time s(t;, 7%) and the actual finish time f (ti, 7%) can be computed. 
Se(tj, m) and fe(t;, 71) can be computed from a partial schedule iteratively 
as follows: 


Se(tj, 71) = max {avail(l), MAXz;€pred(t;)(f (Ti) + hi jk} (6.44 
fe(tj, T1) = Cj + Se(Tj, 71) (6.45) 


where pred(t;) is the set of immediate predecessor tasks of task tj, k is the 
processor task t; is mapped to in the partial schedule, and avail(/) is the time 
that processor z; completed the execution of its last task. The max expression in 
the inner term is the time when all data needed by tj has arrived at processor 77. 

e For HEFT and CPOP, we assume that the makespan is to be minimized. The 
makespan is computed from the actual finish time of the exit node: 


makespan = f (texit) (6.46) 


e The average execution time c; is the execution time ci, averaged over all zx. 
e The upward rank rank, (t;) of a task qt; is the length of the critical path from 
the exit node up to and including node 7;: 


ranku (Texit) = Cexit (6.47) 
rank,(tj) = c+ max (hij + rankųu(tj)) (6.48) 


Tj Esucc(T;) 


succ(t;) is the set of successors of task q; in the task graph. 
¢ The downward rank rank, (t;) of a task qj is the length of the critical path from 
the start node up to and excluding node qj: 


rankg(Tentry) = 0 (6.49) 
rankg(t;) = max ankali) +o + hi j) (6.50) 


Tj €pred(tj 
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The HEFT algorithm is shown below: 


Set the computation and communication costs to mean values; 
Compute rank,(t;)Vt; (upward traversal starting at Texi;); 
Sort tasks in nonincreasing order of rank, values; 
while there are unscheduled tasks in the list do { 
select the first task t; in the list for scheduling; 
for each processor mem { 
compute fe(Ti, 7%) using an insertion based scheduling policy; 


} 


assign task t; to processor mz minimizing f.(t;, Tk); 


} 


In this context, “insertion-based policy” means that the algorithm searches for 
a sufficiently large gap among already scheduled tasks such that an allocation into 
this gap would respect precedence constraints. 


Example 6.17 Suppose that execution times are given by the table in Fig. 6.31 (left). 
Note that for each task, the execution times in Fig. 6.26 (right) have been selected 
as the minimum time among the three processors. Figure 6.31 (center) shows the 
schedule obtained by HEFT for the DAG shown in Fig. 6.26 (left). Precedences 
have been correctly taken into account. We cannot expect to generate the same 
short schedule as for ASAP or ALAP scheduling as these policies ignore resource 
constraints. V 
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Fig. 6.31 Left, execution times; center, results for HEFT; right, results for CPOP 


The CPOP algorithm focuses on the critical path in the DAG and uses different 
task priorities and different processor allocation strategies. The CPOP algorithm 
works as follows: 
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Set the computation and communication costs to mean values; 
Compute Vi : rank,(t;) and rankg(z;); 
Compute Vi : priority(t) = rankg(tj) + rank,(t); 
|CP| = priority(tentry); /* length of the critical path */ 
SETcp = {Tentry}, where SETcp is the set of tasks on the critical path; 
Ti = Tentry ; 
while q; is not the exit task { 
Select tj € succ(t;), where priority(t;) ==|CP|. 
SETcp = SETcp U {tj}; 
ey 
3; 
Select processor mcp minimizing execution time on the critical path; 
Initialize the priority queue with the entry task; 
while there is an unscheduled task in the priority queue { 
Select the highest priority task t; from the priority queue; 
if t € SETcp {assign task t; on mcp } 
else{assign task t; to the processor which minimizes f.(t;, mk)}; 
Update priority queue with successors of t; if they become ready; 


} 


Example 6.18 Figure 6.31 (right) shows the scheduling result for algorithm CPOP. 
V 


The HEFT and CPOP algorithms are fast and relatively simple algorithms. Obvi- 
ously, these algorithms make use of several approximations (e.g., average com- 
munication costs) and heuristics. They were selected for this book to demonstrate 
some key issues of scheduling algorithms for heterogeneous scheduling algorithms. 
However, it is possible to improve over the results of these two algorithms. 

For example, Kim et al. [294] present more complex algorithms generating better 
results. A mapping for KPNs aiming at makespan minimization has been published 
by Castrillon et al. [86]. 


6.5.3 Static Scheduling with Integer Linear Programming 


Integer linear programming can also be applied to the case of heterogeneous 
processors. One approach has been published by Maculan et al. [361]. Most 
importantly, processor-dependent execution times are taken care of. However, the 
presented equations require some refinement before they can be fed into an ILP 
solver and applications have not been included. Also, it is possible to adapt 
techniques published in the context of high-level synthesis [44, 314]. 

In most of the publications, optimizations aim at optimizing a single objective. 
In general, more objectives should be considered. For example, Fard et al. [162] 
present an algorithm taking four different objectives into account. 
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6.5.4 Static Scheduling with Evolutionary Algorithms 


Integer programming based approaches potentially suffer from long execution 
times. In many cases, the use of evolutionary algorithms allows a better optimization 
while still keeping execution times reasonably short. We will demonstrate this by 
means of the distributed operation layer (DOL) tools from ETH Ziirich [537]. These 
tools incorporate 


e Automatic selection of computation templates: Processor types can be com- 
pletely heterogeneous. Standard processors, micro-controllers, DSP processors, 
FPGAs, etc. are all possible options. 

e Automatic selection of communication techniques: Various interconnection 
schemes like central buses, hierarchical buses, rings, etc. are feasible. 

e Automatic selection of scheduling and arbitration: DOL design space explo- 
ration tools automatically choose between rate monotonic scheduling, EDF, and 
TDMA- and priority-based schemes. 


The input to DOL consists of a set of tasks together with use cases. The output 
describes the execution platform, the mapping of tasks to processors together with 
task schedules. This output is expected to meet constraints (like memory size and 
timing constraints) and to minimize objectives (like size, energy, etc.). Applications 
are represented by the so-called problem graphs, which in essence are special 
task graphs. Figure 6.32 shows a simple DOL problem graph. This graph models 
computations (see nodes 1, 2, 3, 4) and communication (see nodes 5, 6, 7). 

In addition, possible execution platforms are represented by the so-called 
architecture graphs. Figure 6.33 shows a simple hardware platform together with 
its architecture graph. Again, communication is modeled explicitly. 

The problem graph and the architecture graph are connected in the specification 
graph. Figure 6.34 shows a DOL specification graph. Specification graphs consist of 
a problem graph and an architecture graph. Edges between the two subgraphs rep- 
resent feasible implementations. For example, computation 1 can be implemented 
only on the RISC processor and computation 3 on the RISC processor or on HWM1. 
Communication 5 can be implemented on the shared bus or locally on the processor 
if computations 1 and 3 are both mapped to the processor. 


Fig. 6.32 DOL problem 
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Fig. 6.33 DOL architecture graph 
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Implementations are represented by a triple: 


An allocation A: A is a subset of the architecture graph, representing hardware 


components allocated (selected) for a particular design. 


A binding b: A selected subset of the edges between specification and architec- 


ture identifies a relation between the two. Selected edges are called bindings. 
e A schedule S: S assigns start times to each node 7; in the problem graph. 


Example 6.19 Figure 6.35 shows how the specification of Fig. 6.34 can be turned 
into an implementation. HWM2 and the PTP bus are not used and not included in the 
set A. A subset b of the edges have been selected for mapping. Nodes 1, 2, 3, and 5 
have indeed all been mapped to the RISC processor, turning communication 5 into 
local communication. Node 4 is mapped to HWM1 and communicates via shared 
bus. Schedule S specifies that computation 1 starts at time 0, communication 5 and 
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computation 2 start at time 1, computation 3 and communication 6 start at time 21, 
communication 7 starts at time 29, and finally computation 4 starts at time 30. V 


In DOL, implementations are generated with evolutionary algorithms. With such 
algorithms, solutions are represented as strings in chromosomes of “individuals” 
[31, 32, 107]. Using evolutionary algorithms, new sets of solutions can be derived 
from existing sets of solutions. The derivation is based on evolutionary operators 
such as mutation, selection, and recombination. The selection of new sets of 
solutions is based on fitness values. Evolutionary algorithms are capable of solving 
complex optimization problems not tractable by other types of algorithms. Finding 
appropriate ways of encoding solutions in chromosomes is not easy. On the one 
hand, the decoding should not require too much run-time. On the other hand, 
we must deal with the situation after the evolutionary transformations. These 
transformations could generate infeasible solutions, except for some carefully 
designed encodings. 

In DOL, chromosomes encode allocations and bindings. In order to evaluate the 
fitness of a certain solution, allocations and bindings must be decoded from the 
individuals (see Fig. 6.36). In DOL, schedules are not encoded in the chromosomes. 
Rather, they are derived from the allocation and binding. This way, overloading 
evolutionary algorithms with scheduling decisions is avoided. Once the schedule 
has been computed, the fitness of solutions can be evaluated. 

The overall architecture of DOL is shown in Fig. 6.37. 
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Fig. 6.37 DOL tool 
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Fig. 6.38 Pareto front of solutions for a design problem, OETHZ 


Initially, the task graph, use cases, and available resources are defined. This can 
be done with a specialized editor called MOSES. This initial information is evaluated 
in the evaluation framework EXPO. Performance values computed by EXPO are then 
sent to SPEA2, an evolutionary algorithm-based optimization framework. SPEA2 
selects good candidate architectures. These are sent back to EXPO for an evaluation. 
Evaluation results are then communicated again to SPEA2 for another round of 
evolutionary optimizations. This kind of ping-pong game between EXPO and SPEA2 
continues until good solutions have been found. The selection of solutions is based 
on the principle of Pareto optimality. A set of Pareto optimal designs is returned to 
the designer, who can then analyze the trade-off between the different objectives. 


Example 6.20 Figure 6.38 shows the resulting visualization of the Pareto front. 
Trade-offs between the performance for two applications and the savings in cost 
can be seen. V 


Holzkamp designed a variant of DOL which focuses on memory optimiza- 
tions [220]. Evolutionary algorithms have become a standard technique for more 
advanced scheduling problems, beyond the problems solved by HEFT or CPOP. 

The functionality of the SystemCodesigner [285] is somewhat similar to that 
of DOL. However, it differs in the way specifications are described (they can be 
represented in SystemC) and in the way the optimizations are performed. The 


6.5 Dependent Jobs on Heterogeneous Multiprocessors 345 


mapping of applications is modeled as an ILP model. A first solution is generated 
using an ILP optimizer. This solution is then improved by switching to evolutionary 
algorithms. !! 

Daedalus [422] incorporates automatic parallelization. For this purpose, sequen- 
tial applications are mapped to Kahn process networks. Design space exploration is 
then performed using Kahn process networks as an intermediate representation. 

Other approaches start from a given task graph and map to a fixed architecture. 
For example, Ruggiero maps applications to cell processors [475]. The HOPES 
system is able to map to various processors [195], using models of computation 
supported by the Ptolemy tools. Some tools take additional objectives into account. 
For example, Xu considers the optimization of the dependable lifetime of the 
resulting system [605]. Simunic incorporates thermal analysis into her work and 
tries to avoid too hot areas on the MPSoC [492]. Further work includes that 
of Popovici et al. [457]. This work uses several levels of modeling, employing 
Simulink and SystemC as languages. 

Auto-parallelizing approaches for fixed architectures include work at the Univer- 
sity of Edinburgh [168]. MAPS tools [88] combine automatic parallelization with 
a limited DSE. Cordes [110] worked on the automatic parallelization for multi- 
cores, using high-level cost models. Neugebauer et al. [417] designed an approach 
to parallelization and used it for the optimization of an innovative sensor for bio- 
viruses. The combination of sensing and information processing demonstrates the 
value of cyber-physical systems. 


6.5.5 Dynamic and Hybrid Scheduling 


For dynamic scheduling, processor allocation is performed at run-time rather than 
at design time. Dynamic scheduling has a number of advantages [493]: 


e Adaptability to the available resources: Dynamic scheduling is able to take 
changing resource availabilities like energy, memory space, and communication 
bandwidth into account. 

e Ability to enable unforseeable upgrades: Changing application requirements 
are easier to integrate when scheduling is dynamic. 

e Resilience to defects: Defective resources like failed processors can be taken 
into account by dynamic scheduling. 

¢ Use of non-real-time platforms: Dynamic scheduling is the standard for non- 
real-time computing. Hence, techniques for non-real-time computing can be 
applied, which helps to reduce development efforts. 


However, there are also disadvantages: 


e Lacking real-time guarantees: In a fully dynamically scheduled system, it is 
difficult if not impossible to give real-time guarantees. 


11A more recent version uses a satisfiability (SAT) solver for the same purpose. 
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¢ Run-time overhead: Dynamic scheduling requires run-time for taking schedul- 
ing decisions. Therefore, complex scheduling techniques must be avoided. 

¢ Limited knowledge: At run-time, there is typically limited knowledge concern- 
ing the task system and its parameters. 


There are two approaches for dynamic scheduling: on-the-fly mapping and hybrid 
mapping using previously analyzed (DSE) results. 

Singh et al. [493] provide an overview of 25 different approaches for on-the-fly 
mapping. This type of mapping is closest to mapping in non-real-time systems. 

Hybrid mapping techniques using previously analyzed (DSE) results try to 
avoid some of the disadvantages listed above by making results from design time 
analysis available at run-time. For example, we could pre-compute schedules for 
likely run-time scenarios and then select at run-time the schedule for the current 
scenario. Singh et al. distinguish between multiple mappings pre-computed for a 
single application, multiple mappings pre-computed for a multiple applications, 
and reliability-aware analysis.'7 The authors provide an overview of 21 differ- 
ent approaches for performing design-time analysis and run-time mapping in a 
sequence. 

One could go one step further by integrating scheduling with the application. For 
example, Kotthaus [307] has designed an approach to mathematical optimization. 
In this approach, the number of evaluations of an objective function is not fixed, but 
depends also on the progress of parallel function evaluations on a multi-core system. 
Similar integration would also be possible for other applications. 


6.6 Problems 


We suggest solving the following problems either at home or during a flipped 
classroom session: 


6.1 Suppose that we have a set of four jobs. Release times r;, deadlines D;, and 
execution times C; are as follows: 


e Ji: rj=10, Dj=18, C1=4 
e Jo: r2=0, D2=28, C2=12 
e J3:7r3=6, D3=17, C3=3 
e J4: r4=3, D4=13, C4=6 


12We merge Singh’s hybrid mappings with these three classes. 
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Fig. 6.39 Precedences @ 
Dy) 


Generate a graphical representation of schedules for this job set, using earliest 
deadline first (EDF) and least laxity (LL) scheduling algorithms! For LL scheduling, 
indicate laxities for all jobs at all context switch times. Will any job miss its 
deadline? 


6.2 Suppose that we have a task set of six tasks tT; to te. Their execution times and 
their deadlines are as follows: 


e t: Dy=15, C1=3 
e t: Do=13, Co=5 
e 13: D3=14, C3=4 
e t4: D4=16, C4=2 
e 15: D3=20, C3=4 
e t6: D4=22, C4=3 


Precedences are as shown in Fig. 6.39. Tasks tı and t2 are available immediately. 
Generate a graphical representation of schedules for this task set, using the latest 
deadline first (LDF) algorithm! 


6.3 Suppose that we have a system comprising two tasks. Task 1 has a period of 
5 and an execution time of 2. The second task has a period of 7 and an execution 
time of 4. Let the deadlines be equal to the periods. Assume that we are using rate 
monotonic scheduling (RMS). Could any of the two tasks miss its deadline, due to a 
too high processor utilization? Compute this utilization, and compare it to a bound 
which would guarantee schedulability! Generate a graphical representation of the 
resulting schedule! Suppose that tasks will always run to their completion, even if 
they missed their deadline. 


6.4 Consider the same task set as in the previous assignment. Use earliest deadline 
first (EDF) for scheduling. Can any of the tasks miss its deadline? If not, why not? 
Generate a graphical representation of the resulting schedule! Suppose that tasks 
will always run to their completion. 
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Chapter 7 A 
Optimization P 


Embedded systems have to be efficient (at least) with respect to the objectives 
considered in this book. In particular, this applies to resource-constrained mobile 
systems, including sensor networks embedded in the Internet of Things. In order 
to achieve this goal, many optimizations have been developed. Only a small subset 
of those can be mentioned in this book. In this chapter, we will present a selected 
set of such optimizations. This chapter is structured as follows: first of all, we will 
present some high-level optimization techniques, which could precede compilation 
of source code or could be integrated into it. We will then describe concurrency 
management for tasks. Section 7.3 comprises advanced compilation techniques. The 
final Sect. 7.4 introduces power and thermal management techniques. 

As indicated in our design flow, these optimizations complement the tools 
mapping applications to the final systems, as described in Chap. 6 and as shown 
in Fig. 7.1. Mapping tools may be optimizing, and optimization techniques may 
involve scheduling. Hence, the scopes of the current and of Chap. 6 are partially 
overlapping. The focus of Chap. 6 is on fundamental knowledge for mapping to 
platforms, while the current chapter deals mostly with improvements over basic 
techniques and is similar to the character of an elective. 


7.1 High-Level Optimizations 


In the next section, we will be considering optimizations which can be applied to the 
source code of embedded software, before compilation or during early compilation 
phases. Detecting regular structures such as array access patterns may be easier at 
the source code level than at the machine code level. Also, optimization effects can 
usually be expressed by rewriting the source program, i.e., the modified code can 
be expressed in the source language. This helps in understanding the effect of such 
transformations. We do also consider cases in which it may be necessary to annotate 
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the source code with compiler directives and hints. Such code transformations are 
called high-level optimizations. They have the potential to improve the efficiency 
of embedded software. 


7.1.1 Simple Loop Transformations 


There are a number of loop transformations that can be applied to source code. The 
following is a list of standard loop transformations: 


e Loop permutation: Consider a two-dimensional array. According to the C 
standard [289], two-dimensional arrays are laid out in memory as shown in 
Fig. 7.2. Adjacent index values of the second index are mapped to a contiguous 
block of locations in memory. This layout is called row-major order [405]. 

For row-major layout, it is usually beneficial to organize loops such that the 
last index corresponds to the innermost loop. Note that the layout for arrays 
is different for Fortran: adjacent values of the first index are mapped to a 
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contiguous block of locations in memory (column-major order). Switching 
between publications describing optimizations for Fortran and for C can therefore 
be confusing. 


Example 7.1 The following is a loop permutation: 


for (k=0; k<m; k++) for (j=0; j<n; j++) 
for (j=0; j<n; j++) < for (k=0; k<m; k++) 


Such permutations may have a positive effect on the reuse of array elements in 
the cache, since the next iteration of the innermost loop will access an adjacent 
location in memory. V 


Caches are normally organized such that adjacent locations can be accessed 
significantly faster than locations that are further away from the previously 
accessed location. In this way, caches are exploiting spatial locality. 


Definition 7.1 Consider memory references to memory addresses a and b. 
Suppose that we assume an access to a. We observe spatial locality if—under 
this condition—the probability of also accessing b increases for small differences 
of addresses a and b. 


Loop unrolling: Loop unrolling is a standard transformation creating several 
instances of the loop body. 


Example 7.2 In this example, we unroll the loop: 


for (j=0; j<n; j++) for (j=0; j<n; j+=2) 
pljJ=...; s SPESSA TE 
PETRIE a 5 SP 
In this particular case, the loop is unrolled once. V 


The number of copies of the loop is called the unrolling factor. Unrolling factors 
larger than two are possible. Unrolling reduces the loop overhead (less branches 
per execution of the original loop body) and therefore typically improves the 
speed. As an extreme case, loops can be completely unrolled, removing control 
overhead and branches altogether. Unrolling typically enables a number of 
following transformations and may therefore be beneficial even in cases where 
just unrolling the program does not give any advantages. However, unrolling 
increases code size. Unrolling is normally restricted to loops with a constant 
number of iterations. 

Loop fusion, loop fission: There may be cases in which two separate loops can 
be merged, and there may be cases in which a single loop is split into two. 
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Example 7.3 Consider the two versions of the following code: 


for (j=0; j<n; j++) for (j=0; j<n; j++) 
MN sooi ASHE eoo $ 

for (j=0; j<n; j++) < pljl=plj]+ ...} 
pljJ=plj]+ ... 


The left version may be advantageous if the target processor provides a zero- 
overhead loop instruction which can only be used for small loops. Also, the left 
version may provide good candidates for unrolling, due to the simple loops. The 
right version might lead to an improved cache behavior (due to the improved 
locality of references to array p) and also increases the potential for parallel 
computations within the loop body. As with many other transformations, it is 
difficult to know which of the transformations leads to the best code. V 


7.1.2 Loop Tiling/Blocking 


Since small memories are faster than large memories (see p. 170), the use of 
memory hierarchies may be beneficial. Possible “small” memories include caches 
and scratchpad memories. A significant reuse factor for the information in those 
memories is required. Otherwise the memory hierarchy cannot be exploited. 


Example 7.4 Reuse effects can be demonstrated by an analysis of the following 
example. Let us consider matrix multiplication for arrays of sizeN x N: 


for (i=0; i<N; i++) 
for(j=0; j<N; j++) { 
r=0; 
for (k=0; j<N; k++) 
r+=X[i][k]*Y[k][j]; 
ZL[i][j]=r; 
} 


Scalar variable r represents Z[i, j] in all iterations of the innermost loop. This 
is supposed to help the compiler to allocate this element temporarily to a register. 

Let us consider access patterns for this code, as shown in Fig. 7.3. We assume 
that array elements are allocated in row-major order (as it is standard for C). 

This means that array elements with adjacent row (right most) index values 
are stored in adjacent memory locations. Accordingly, adjacent locations of X are 
fetched during the iterations of the innermost loop. This property is beneficial if the 
memory system uses prefetching (whenever a word is loaded into the cache, loading 
of the next word is started as well). Accesses to Y do not exhibit spatial locality. If 
the cache is not large enough to hold a full cache row, every access to Y will be a 
cache miss. Hence, there will be N? references to elements of Y in main memory. 


7.1 High-Level Optimizations 353 


Fig. 7.3 Access pattern for unblocked matrix multiplication 


Research on scientific computing led to the design of blocked or tiled algo- 
rithms [320, 606], which improve the locality of references. The following is a 
tiled version of the above algorithm! for a block size parameter B: 


for (ii=0; kk<N; ii+=B) 
for (jj=0; jj<N; jj+=B) 
for (kk=0; kk<N; kk+=B) 
for (i=ii; i<min(ii+B-1,N); ii++) 
for (j=jj; j<min(jj+B-1,N); jj++) { 
r=0; 
for (k=kk; k<min(kk+B-1,N); k++) 
r+= X[i][k]*Y[k][j]; 
ZCilCj1=r; 
} 


Now, the two innermost loops are constrained to traverse a block of size B* for 
array Y. Suppose that a block of size B? fits into the cache. Then, the first execution of 
the innermost loop will load this block into the cache. During the second execution 
of the innermost loop, these elements will be reused. Overall, there will be B-1 
reuses of elements of Y. Hence, the number of accesses to main memory for elements 
of this array will be reduced to N3/(B-1). 

V 


Optimizing the reuse factor has been an area of comprehensive research. Initial 
research focused on the performance improvements that can be obtained by tiling. 
Performance improvements for matrix multiplication by a factor between 3 and 4.3 
were reported by Lam [320]. Improvements increase with an increasing gap between 
processor and memory speeds. Tiling can also reduce the energy consumption of 
memory systems [103]. 


'This code was adopted from http://www.netlib.org/utk/papers/autoblock/node2.html. 
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Fig. 7.4 Splitting image processing into regular and special cases 


7.1.3 Loop Splitting 


Next, we discuss loop splitting as another optimization that can be applied before 
compilation. Potentially, this optimization could also be added to compilers. 

Many image processing algorithms perform some kind of filtering. This filtering 
consists of considering the information about a certain pixel as well as that of some 
of its neighbors. Corresponding computations are typically quite regular. However, 
if the considered pixel is close to the boundary of the image, not all neighboring 
pixels exist, and the computations must be modified. In a straightforward description 
of the filtering algorithm, these modifications may result in tests being performed in 
the innermost loop of the algorithm. A more efficient version of the algorithm can 
be generated by splitting the loops such that one loop body handles the regular 
cases and a second loop body handles the exceptions. Figure 7.4 is a graphical 
representation of this transformation. Margin checking is required for the yellow 
areas. 

Performing this loop splitting manually is very difficult and error-prone. Falk 
et al. have published an algorithm [159] which also works for larger dimensions 
automatically. It is based on a sophisticated analysis of accesses to array elements 
in loops using polyhedral analysis [586]. Optimized solutions are generated using 
genetic algorithms from the PGAPack library [340]. Falk’s algorithm can be 
implemented, e.g., as a compiler pre-pass tool. 


Example 7.5 The following code shows a loop nest from the MPEG-4 standard 
performing motion estimation: 


for (z=0; z<2Q0; zt++) 
for (x=0; x<36; xt+) {x1=4«x; 
for (y=0; y<49; yt+) {y1=4ay; 
for (k=0; k<9; k++) {x2=x1+k-4; 
for (1=0; L<9; l++) {y2=y1+l-4; 
for (i=0; i<4; i++) {x3=x1+i; x4=x2t+i; 
for (j=0; j<4; j++) {y3=y1+j; y4=y2+j; 
if (x3<@ || 35<x3 || y3<@ || 48<y3) 
then_block_1; else else_block_1; 
if (x4<@ || 35<x4 || y4<®@ || 48<y4) 
then_block_2; else else_block_2; 
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Falk’s algorithm detects that the conditions x3<@ and y3<@ are never true. The 
analysis allows transforming the loop nest into the code below. Instead of complex 
tests in the innermost loop, we have a splitting if-statement after the third for-loop 
statement. Regular cases are handled in the then part of this statement. The else 
part handles the relatively small number of remaining cases: 


for (z=0; z<2Q0; zt++) 
for (x=0; x<36; xt+) {x1=4«x; 
for (y=0; y<49; y++) 
if (x>=10 || y>=14) 
for (; y<49; y++) 
for (k=0; k<9; k++) 
for (1=0; L<9; l++ ) 
for (i=0; i<4; i++) 
for (j=0; j<4; j++) { 
then_block_1; then_block_2} 
else {y1=4xy; 
for (k=0; k<9; k++) {x2=x1+k-4; 
for (1=0; L<9; l++) {y2=y1+1l-4; 
for (i=0; i<4; i++) {x3=x1ti; x4=x2ti; 
for (j=0; j<4; j++) {y3=y1+j; y4=y2+j; 
if ( @ || 35 <x3 || @ || 48 < y3) /* x3<0, y3<@ never true */ 
then_block_1; else else_block_1; 
if (x4 < Q|| 35 < x4 || y4 < @ || 48 < y4) 
then_block_2; else else_block_2; 


V 


Run-times can be reduced by loop splitting for various applications and architec- 
tures. Resulting relative run-times are shown in Fig. 7.5. For the motion estimation 
algorithm, cycle counts can be reduced by up to about 75% (to 25% of the original 
value). Substantial savings (larger than for the simple transformations mentioned 
earlier) are possible. 


7.1.4 Array Folding 


Some embedded applications, especially in the multimedia domain, include large 
arrays. Since memory space in embedded systems is limited, options for reducing 
the storage requirements of arrays should be explored. Figure 7.6 represents the 
addresses used by five arrays as a function of time. At any particular time, only a 
subset of array elements is needed. The maximum number of elements needed is 
called the address reference window [122]. In Fig. 7.6, this maximum is indicated 
by a double-headed arrow. A classical memory allocation for arrays is shown in 
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Fig. 7.7 (left). Each array is allocated the maximum of the space it requires during 
the entire execution time (if we consider global arrays). 

One of the possible improvements, inter-array folding, is shown in Fig. 7.7 
(center). Arrays which are not needed at overlapping time intervals can share the 
same memory space. A second improvement, intra-array folding [121], is shown 
in Fig. 7.7 (right). It takes advantage of the limited sets of components needed 
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within an array. Storage can be saved at the expense of more complex address 
computations. The two kinds of foldings can also be combined. 

Other forms of high-level transformations have been analyzed by Chung, Benini, 
and De Micheli [103, 524]. There are many additional contributions in this domain 
in the compiler community. 


7.1.5 Floating-Point to Fixed-Point Conversion 


Floating-point to fixed-point conversion is a commonly used optimization tech- 
nique. This conversion is motivated by the fact that many signal processing 
standards (such as MPEG-2 or MPEG-4) are specified in the form of C-programs 
using floating-point data types. It is left to the designer to find an efficient 
implementation of these standards. 

For many signal processing applications, it is possible to replace floating-point 
numbers with fixed-point numbers (see p. 153). The benefits may be significant. For 
example, a reduction of the cycle count by 75% and of the energy consumption 
by 76% has been reported for an MPEG-2 video compression algorithm [225]. 
However, some loss of precision is normally incurred. More precisely, there is a 
trade-off between the cost of the implementation and the quality of the algorithm 
(evaluated, for example, in terms of quality metrics; see Sect.5.3 on p. 254). 
For small word lengths, the quality may be seriously affected. Consequently, the 
quality loss has to be analyzed. This replacement was initially performed manually. 
However, it is a very tedious and error-prone process. 

Therefore, researchers have tried to support this replacement with tools. One 
of such tools is FRIDGE (fixed-point programming design environment) [283, 588]. 
The functionality of FRIDGE has been made available commercially as part of the 
Synopsys System Studio tool suite [518]. 

SystemC can be used for simulating fixed-point data types. 

An analysis of the trade-offs between the additional noise introduced and the 
word length needed was proposed by Shi and Brodersen [486] and also by Menard 
et al. [390]. The topic continues to attract researchers [334], also in the context of 
machine learning [454]. 


7.2 Task-Level Concurrency Management 


As mentioned on p. 38, the task graphs’ granularity is one of their most important 
properties. Even for hierarchical task graphs, it may be useful to change the 
granularity of the nodes. The partitioning of specifications into tasks or processes 
does not necessarily aim at the maximum implementation efficiency. Rather, during 
the specification phase, a clear separation of concerns and a clean software model 
are more important than caring about the implementation too much. For example, 
a clear separation of concerns includes a clear separation of the implementation 
of abstract data types from their use. As a result of the design process, tasks will 
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Fig. 7.9 Splitting of tasks 


typically become objects within the operating system, i.e., processes (cf. Defini- 
tion 4.1) or threads. Also, we might be using several tasks in a pipelined fashion 
in our specification, while merging some of them might reduce context switching 
overhead. Hence, there will not necessarily be a one-to-one correspondence between 
the tasks in the specification and those in the implementation. This means that 
a regrouping of tasks may be advisable. Such a regrouping is indeed feasible by 
merging and splitting of tasks. 

Merging of task graphs can be performed whenever some task q; is the immediate 
predecessor of some other task t; and if t; does not have any other immediate 
predecessor (see Fig. 7.8 with t; = t3 and t; = 14). This transformation can lead to 
a reduced overhead of context switches if the node is implemented in software, and 
it can lead to a larger potential for optimizations in general. 

On the other hand, splitting of tasks may be advantageous, since tasks may be 
holding resources (like large amounts of memory) while they are waiting for some 
input. In order to maximize the use of these resources, it may be best to constrain the 
use of these resources to the time intervals during which these resources are actually 
needed. 


Example 7.6 In Fig.7.9, we are assuming that task t2 requires some input some- 
where in its code. 

In the initial version, the execution of task t2 can only start if this input is 
available. We can split the node into t¥ and t;* such that the input is only required 
for the execution of t;*. Now, t% can start earlier, resulting in more scheduling 
freedom. This improved scheduling freedom might improve resource utilization 
and could even enable meeting some deadline. It may also have an impact on the 
memory required for data storage, since tX could release some of its memory before 
terminating and this memory could be used by other tasks while t;* is waiting for 
input. V 


One might argue that the tasks should release resources like large amounts 
of memory before waiting for input. However, the readability of the original 
specification could suffer from caring about implementation issues in an early 
design phase. 
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Quite complex transformations of the specifications can be performed with a 
Petri net-based technique described by Cortadella et al. [111]. Their technique starts 
with a specification consisting of a set of tasks described in a language called FlowC. 
FlowC extends C with process headers and inter-task communication specified in 
the form of read and write function calls. 


Example 7.7 Figure 7.10 shows an input specification using FlowC. The example 
uses input ports IN and COEF, as well as an output port OUT. Point-to-point 


"0 


PROCESS GetData 
(InPort IN, OutPort DATA){ 
float sample, sum; int i; 
while(1) { 
sum=0; 
for (i=; i<N; i++){ 
READCIN, sample, 1); sum+=sample; 
WRITE(DATA, sampLe, 1); 
} 
WRITE(DATA, sum/N, 1); 


OUT DATA COEF 


PROCESS Filter(InPort DATA, 
InPort COEF, OutPort OUT){ 
floatie di athe Se 
c=1; J=0; 
while(1) { 
SELECT (DATA, COEF) { 
case DATA:READ(DATA,d,1); 
if (j==N) {j=0;d=d*c;WRITE(OUT,d,1); 
} 
else j++; break; 
case COEF:READ(COEF,c,1); break; 
} 
} 
} 


Fig. 7.10 System specification 
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interprocess communication between processes is realized through a unidirectional 
buffered channel DATA. Task GetData reads data from the environment and sends 
it to channel DATA. Each time N samples have been sent, their average value is 
also sent via the same channel. Task Filter reads N values from the channel (and 
ignores them), then reads the average value, and multiplies the average value by 
c. (c can be read from port COEF). Filter writes the result to port OUT. The third 
parameter in READ and WRITE calls is the number of items to be read or written. 
READ calls are blocking, and WRITE calls are blocking if the number of items in 
the channel exceeds a predefined threshold. The SELECT statement has the same 
semantics as the statement with the same name in Ada (see p. 112): execution 
of this task is suspended until input arrives from one of the ports. This example 
meets all criteria for splitting tasks that were mentioned in the context of Fig. 7.9. 
Both tasks will be waiting for input while occupying resources. Efficiency could be 
improved by restructuring these tasks. However, the simple splitting of Fig. 7.9 is 
not sufficient. The technique proposed by Cortadella et al. is more comprehensive: 
FlowC programs are first translated into (extended) Petri nets. Petri nets for each of 
the tasks are then merged into a single Petri net. Using results from Petri net theory, 
new tasks are then generated. Figure 7.11 shows a possible new task structure. 

In this new task structure, there is one task which performs all initializations: in 
addition, there is one task for each of the input ports. An efficient implementation 
would raise interrupts each time new input is received for a port. There should be a 


OQ #0 


InitO{ tau_in(){ 
sum=0; i=Q; c=1; j=0; READ(IN, sample, 1); 
} sumt=sample; i++; 
DATA=sample; d=DATA; 
if(j==N) { 
COEF O j=; d=d*c; WRITE(OUT,d,1); 
} 
else jt+t; 
tau_coef (){ LQ:if (i<N) return; 
READ(COEF ,c,1); DATA=sum/N; d=DATA; 
} if (j==N){ 
j=0; d=d*c; WRITE(OUT,d,1); 
} 
else j++; 


sum=0; i=0; goto LQ; 


Fig. 7.11 Generated software tasks 
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unique interrupt per port. The tasks could then be started directly by those interrupts, 
and there would be no need to invoke the operating system for that. Communication 
can be implemented as a single shared global variable (assuming a shared address 
space). The operating system overhead would be small, if required at all. 

The code for task tau_in shown in Fig. 7.11 is the one that is generated by the 
Petri net-based inter-task optimization of the task structure. It should be further 
optimized by intra-task optimizations, since the test performed for the first if 
statement is always false (j is equal to i-1 in this case, and i and j are reset to 
© whenever i becomes equal to N). For the third if statement, the test is always 
true, since this point of control is only reached if i is equal to N and i is equal to 
j whenever label L@ is reached. Also, the number of variables can be reduced. The 
following is an optimized version of tau_in [111]: 


tau_in () { 

READ(IN, sample, 1); 

sum+=sample; i++; 

DATA=sample; d=DATA; /* merging of DATA & d feasible */ 
Lð: if (i<N) return; 

DATA=sum/N; d=DATA; 

d=d*c; WRITE(OUT,d,1); 

sum=0; i=Q; 

return; 


} 


The optimized version of tau_in could be generated by a clever compiler. Hardly 
any of today’s compilers will generate this version, but the example shows the type 
of transformations required for generating “good” task structures. V 


For more details about the task generation, refer to Cortadella et al. [111]. Similar 
optimizations are described in the book by Thoen [538] and in a publication by 
Meijer et al. [389]. 


7.3 Compilers for Embedded Systems 


7.3.1 Introduction 


Obviously, optimizations and compilers are available for the processors used in PCs 
and servers. Compiler generation for commonly used processors is well understood. 
For embedded systems, standard compilers are also used in many cases, since they 
are typically cheap or even freely available. 

However, there are several reasons for designing special optimizations and 
compilers for embedded systems: 


e Processor architectures in embedded systems exhibit special features (see p. 143). 
These features should be exploited by compilers in order to generate efficient 
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code. Compilation techniques might also have to support compression techniques 
described on p. 148-p. 150. 

e A high efficiency of the code is more important than a high compilation speed. 

e Compilers could potentially help to meet and prove real-time constraints. First of 
all, it would be nice if compilers contained explicit timing models. These could 
be used for optimizations which really improve the timing behavior. For example, 
it may be beneficial to freeze certain cache lines in order to prevent frequently 
executed code from being evicted and reloaded several times. 

e Compilers may help to reduce the energy consumption of embedded systems. 
Compilers performing energy optimizations should be available. 

e For embedded systems, there is a larger variety of instruction sets. Hence, there 
are more processors for which compilers should be available. Sometimes, there is 
even the request to support the optimization of instruction sets with retargetable 
compilers. For such compilers, the instruction set can be specified as an input 
to a compiler generation system. Such systems can be used for experimentally 
modifying instruction sets and then observing the resulting changes for the 
generated machine code. This is one particular case of design space exploration 
and is supported, for example, by Tensilica tools [82]. 


Some approaches for retargetable compilers are described in a book on this topic 
[376]. Optimizations can be found in books by Leupers et al. [337, 338]. In 
this Section, we will present examples of compilation techniques for embedded 
processors. 


7.3.2 Energy-Aware Compilation 


Many embedded systems are mobile systems which must run on batteries. While 
computational demands on mobile systems are increasing, battery technology is 
expected to improve only slowly [414]. Hence, the availability of energy is a serious 
bottleneck for new applications. 

Saving energy can be done at various levels, including the fabrication process 
technology, the device technology, the circuit design, the operating system, and the 
application algorithms. Adequate translation from algorithms to machine code can 
also help. High-level optimization techniques such as those presented on p. 349-p. 
357 can also help to reduce the energy consumption. In this subsection, we will look 
at compiler optimizations which can reduce the energy consumption (frequently 
called low-power optimizations). Energy models are very essential ingredients of 
all energy optimizations. Energy models were presented in Chap. 5. Using models 
like those, the following compiler optimizations have been used for reducing the 
energy consumption: 


e Energy-aware scheduling: the order of instructions can be changed as long as 
the meaning of the program does not change. The order can be changed such that 
the number of transitions on the instruction bus is minimized. This optimization 
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can be performed on the output generated by a compiler and therefore does not 
require any change to the compiler. 

e Energy-aware instruction selection: typically, there are different instruction 
sequences for implementing the same source code. In a standard compiler, the 
number of instructions or the number of cycles is used as a criterion (cost 
function) for selecting a good sequence. This criterion can be replaced by the 
energy consumed by that sequence. Steinke and others found that energy-aware 
instruction selection reduces the energy consumption by some percent [509]. 

¢ Replacing the cost function is also possible for other standard compiler opti- 
mizations, such as register pipelining, loop invariant code motion, etc. Possible 
improvements are also in the order of a few percent. 

¢ Exploitation of the memory hierarchy: as already explained on p. 168, 
smaller memories provide faster access and consume less energy per access. 
Therefore, a significant amount of energy can be saved if memory hierarchies 
are exploited. Of all the compiler optimizations analyzed by Steinke [511, 512], 
the energy savings enabled by memory hierarchies are the largest. It is therefore 
beneficial to use small scratchpad memories (SPMs; see p. 172) in addition to 
large background memories. All accesses to the corresponding address range 
will then require less energy and are faster than accesses to the larger memory. 
The compiler should be responsible for allocating variables and instructions to 
the scratchpad. This approach does, however, require that frequently accessed 
variables and code sequences are identified and mapped to that address range. 


7.3.3 Memory-Architecture Aware Compilation 
Compilation Techniques for Scratchpads 


The advantages of using SPMs have been clearly demonstrated [36]. Therefore, 
exploiting SPMs is the most prominent case of memory hierarchy exploitation. 
Available compilers are usually capable of mapping memory objects to certain 
address ranges in the memory. Toward this end, the source code typically has to 
be annotated. 


Example 7.8 For ARM® tools, memory segments can be introduced in the source 
code by using pragmas like 


# pragma arm section rwdata = "foo", rodata = "bar" 


Variables declared after this pragma would be mapped to read-write segment 
"foo," and constants would be mapped to read-only segment "bar." Linker 
commands can then map these segments to particular address ranges, including 
those belonging to the SPM. V 
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This is the approach taken in compilers for ARM processors [20]. This is not a 
very comfortable approach, and it would be nice if compilers could perform such 
a mapping automatically for frequently accessed objects. Therefore, optimization 
algorithms have been designed. Some of these optimizations have been presented 
in a separate book [378]. Available SPM optimizations can be classified into two 
categories: 


e Non-overlaying (or “static”) memory allocation strategies: for these strategies, 
memory objects will stay in the SPM while the corresponding application is 
executed. 

e Overlaying (or “dynamic”) memory allocation strategies: for these strategies, 
memory objects are moved in and out of the SPM at run-time. This is a kind of 
“compiler-controlled paging,” except the migration of objects happens between 
the SPM and some slower memory and does not involve any disks. 


Non-overlaying Allocation 


For non-overlaying allocation, we can start by considering the allocation of 
functions and global variables to the SPM. For this purpose, each function and each 
global variable can be modeled as a memory object. Let 


e S be the size of the SPM, 

e sf; and sv; be the sizes of function i and variable i, respectively, 

e g be the energy consumption saved per access to the SPM (i.e., the difference 
between the energy required per access to the slow main memory and the one 
required per access to the SPM), 

e nf; and nv; be the number of accesses to function i and variable i, respectively, 

e xfi and xv; be defined as 


1 if function i is mapped to the SPM 
xfi= ee PP (7.1) 
0 otherwise 
jee 1 if vannils i is mapped to the SPM (7.2) 
0 otherwise 


Then, the goal is to maximize the gain 


a= (Zas vafi + Enu sxa) (1.3) 


i i 


while respecting the size constraint 


Yo sfixxfit Y svi xxv < 8 (7.4) 


L 
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The problem is known as a (simple) knapsack problem (see p. 320 for the more 
general case). Standard knapsack algorithms can be used for selecting the objects 
to be allocated to the SPM. However, Eqs. (7.3) and (7.4) also have the form of an 
integer linear programming (ILP) problem (see Appendix A), and ILP solvers can be 
used as well. g is a constant factor in the objective function and is not needed for the 
solution of the ILP problem. The corresponding optimization can be implemented 
as a pre-pass optimization (see Fig. 7.12). 

The optimization impacts addresses of functions and global variables. Compilers 
typically allow a manual specification of these addresses in the source code. 
Hence, no change to the compiler itself is required. The advantage of such a pre- 
pass optimization is that it can be used with compilers for many different target 
processors. There is no need to modify a large number of target-specific compilers. 

The knapsack model can be extended into various directions: 


e Allocation of basic blocks: The approach just described only allows the 
allocation of entire functions or variables to the SPM. As a result, a major fraction 
of the SPM may remain empty if functions and variables are large. Therefore, 
we try to reduce the granularity of the objects which are allocated to the SPM. 
The natural choice is to consider basic blocks as memory objects. In addition, 
we do also consider sets of adjacent basic blocks, where adjacency is defined 
as being adjacent in the control flow graph [509]. We call such sets of adjacent 
blocks multi-basic blocks. Figure 7.13 shows a control flow graph and the set of 
considered multi-basic blocks. 


Source Pre-pass (ARM- or gcc) Target 
code optimizations 7| compiler code 


Memory hierarchy 
description 
(e.g. SPM size) 


Fig. 7.12 Pre-pass optimization 


Fig. 7.13 Basic blocks and Multi basic blocks: 
multi-basic blocks {BB1, BB2, BB3} 


{BB1, BB2} 
BB2 {BB1, BB3} 
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The ILP model can be extended accordingly. Let 


— sb; and sm; be the sizes of basic blocks i and multi-basic blocks i, respec- 
tively, 

— nb; and nm; be the number of accesses to basic block i and multi-basic blocks 
i, respectively, 

— xb; and xm; be defined as 


iba | lif pas block i is mapped to the SPM (7.5) 
0 otherwise 

ee 1 if mult basic block i is mapped to the SPM (1.6) 
0 otherwise 


Then, the goal is to maximize the gain 


G=g (Do -xfi+ X nbj -xbi+ nm; -xmjt+ nv; au) (7.7) 
j i i i 


l 


while respecting the constraints 


L 


XO sfixxfi + Y sbi *xbi +Y smi * xm; +9 svi * xvi < S (7.8) 
i i i 


V basic blocks i : xbj + xf fei + 5 xm; <1 (7.9) 


i’emultibasicblock(i) 


fct(@ is the function containing basic block i and multibasicblock(1) is the set of 
multi-basic blocks containing basic block i. 

The constraint (7.9) ensures that a basic block is mapped to the SPM only 
once, instead of potentially being mapped as a member of the enclosing function 
and a member of a multi-basic block. 

Experiments using this model were performed by Steinke et al. [512]. For 
some benchmark applications, energy reductions of up to about 80% were found, 
even though the size of the SPM was just a small fraction of the total code size 
of the application. Results for the bubble sort program are shown in Fig. 7.14. 
Obviously, larger SPMs lead to a reduced energy consumption in the main 
memory (see white boxes). The energy required in the CPU is also reduced, since 
less wait cycles are required. The SPM needs only small amounts of energy (see 
the tiny blue boxes). Supply voltages have been assumed to be constant, even 
though a faster execution could have allowed us to scale down frequencies and 
voltages, leading to an even larger energy reduction. 
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e Partitioned memories [572]: Small memories are faster and require less energy 
per access. Therefore, it makes sense to partition memories into several smaller 
memories. The ILP model can be extended easily to also model several mem- 
ories. We do not distinguish between the various types of memory objects 
(functions, basic blocks, variables, etc.) in this case. An index i represents any 
memory object. Let 


— Sj be the size of the memory j, 

— si be the size of object i (as before), 

— ej be the energy consumption per access to memory j, 
— nj; the number of accesses to object i (as before), 

— x;,; be defined as 


Hee | 1 if object i is mapped to memory j (7.10) 


0 otherwise 


Instead of maximizing the energy saving, we are now minimizing the overall 
energy consumption. Hence, the goal is now to minimize 


Cay ej xij*ni (1.11) 
j i 
while respecting the constraints 
Yj: J Bees (7.12) 
i 


vi: tego (7.13) 
j 
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Partitioned memories are advantageous especially for varying memory require- 
ments. Storage locations accessed frequently are called the working set of 
an application. Applications with a small working set could use a very small 
fast memory, whereas applications requiring a larger working set could be 
allocated to a somewhat larger memory. Therefore, a key advantage of partitioned 
memories is their ability to adapt to the size of the current working set. 
Furthermore, unused memories can be shut down to save additional energy. 
However, we are considering only the “dynamic” energy consumption caused 
by accesses to the memory. In addition, there may be some energy consumption 
even if the memory is idle. This consumption is not considered here. Therefore, 
savings from shutting down memories are not reflected in Eqs. (7.11) and (7.12). 
e Link/load-time allocation of memory [420]: Optimizing code at compile time 
for a certain SPM size has a disadvantage—the code might perform badly if we 
run it on different variants of some processor if these variants have differently 
sized SPMs. We would like to avoid requiring different executable files for the 
different variants of the processor. As a result, we are interested in executables 
which are independent of the SPM size. This is feasible if we perform the 
optimization at link time. The proposed approach computes the ratio of the 
number of accesses divided by the size of a variable at compile time and stores 
this value together with other information about variables in the executable. At 
load time, the OS is queried for the size of the SPM. Then, the code is patched 
such that as many profitable variables as possible are allocated to the SPM. 


Overlaying Allocation 


Large applications may have multiple hot spots (multiple areas of code containing 
compute-intensive loops). Non-overlaying approaches fail to provide the best 
possible results in this context. For such applications, the SPM should be exploited 
for each of the hot spots. This requires an automatic migration between the layers 
in the memory hierarchy. For overlaying algorithms, memory objects are migrated 
between different levels of the hierarchy.” This migration can be either programmed 
explicitly in the application or inserted automatically. Overlaying algorithms are 
beneficial for applications with multiple hot spots, for which the code or data can 
be evicting each other. For overlaying algorithms, we are typically assuming that 
all applications are known at design time such that memory allocation can be 
considered at this time. Algorithms by Verma [555] and by Udayakumararan et 
al. [548] are early examples of such algorithms. 

Verma’s algorithm starts with the CFG of the application to be optimized. For 
edges of the graph, Verma considers potentially freeing the SPM for locally used 


Some of the material in this subsection has also been included in a separate book by the same 
author and publisher [378]. 
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Fig. 7.15 Potential spill code 


memory objects by storing these objects in some slower memory and later restoring 
them. Blocks of code are handled as if they were arrays of data. 


Example 7.9 In Fig.7.15, we are considering control blocks B1—B10 and control 
flow branching at B2. We assume that array A is defined, modified, and used along 
the left path. T3 is only used in the right part of the branch. We consider potentially 
freeing the SPM so that T3 can be locally allocated to the SPM. This requires spill 
and load operations in potentially inserted blocks B9 and B10 (dotted lines: potential 
inserts). Cost and benefit of these spill operations are then incorporated into a global 
ILP. Solving the ILP yields an optimal set of memory copy operations. V 


For a set of benchmarks, the average reductions in energy consumption and execu- 
tion time, compared to the non-overlaying case, are 34% and 18%, respectively. 

Udayakumararan’s algorithm is similar, but it evaluates memory objects accord- 
ing to their number of memory accesses divided by their size. This metric is then 
used to heuristically guide the optimization process. This approach can also take 
heap objects into account. 

Large arrays are difficult to allocate to SPM. In fact, even a single array can be 
too large to fit into an SPM. The splitting strategy of Verma [160] is restricted to a 
single-array splitting. Loop tiling is a more general technique, which can be applied 
either manually or automatically [344]. Furthermore, array indexes can be analyzed 
in detail such that frequently accessed array components can be kept in the SPM 
[357]. 


370 7 Optimization 


Our explanations have so far mainly addressed code and global data. Stack and 
heap data require special attention. In both cases, two trivial solutions may be 
feasible: in some cases, we might prefer not to allocate code or heap data to the 
SPM at all. In other cases, we could run stack [5] and heap size analysis [219] to 
check whether stack or heap fit completely into the SPM and, if they do, allocate 
them to the SPM. 

For the heap, Dominguez et al. [134] proposed to analyze the liveness of heap 
objects. Whenever some heap object is potentially needed, code is generated to 
ensure that the object will be in the SPM. Objects will always be at the same address, 
so that the problem of dangling references to heap objects in the SPM is avoided. 
Mclilroy et al. [384] propose a dynamic memory allocator taking characteristics 
of SPM into account. Bai et al. [33] suggest that the programmer should enclose 
accesses to global pointers by two functions p2s and s2p. These functions provide 
conversions between global and local (SPM) addresses and also ensure a proper 
copying of memory contents. 

For stack variables, Udayakumararan et al. [548] proposed to use two stacks, one 
for calls to short functions with their stack being in main memory and one for calls 
to computationally expensive functions whose stack area is in the SPM. Kannan et 
al. [281] suggested to keep the top stack frames in the SPM in a circular fashion. 
During function calls, a check for a sufficient amount of space for the required stack 
frame is made. If the space is not available, old stack frames are copied to a reserved 
area in main memory. During returns from function calls, these frames can be copied 
back. Various optimizations aim at minimizing the necessary checks. 


Multiple Threads/Processes 


The above approaches are still limited to handling a single process or thread. For 
multiple threads, moving objects into and out of the SPM at context switch time has 
to be considered. Verma [556] proposed three different approaches: 


1. For the first approach, only a single process owns space in the SPM at any given 
time. At each context switch, the information of the preempted process in the 
occupied space is saved, and the information for the process to be executed is 
restored. This approach is called the saving/restoring approach. This approach 
does not work well with large SPMs, since the copying would consume a 
significant amount of time and energy. 

2. For the second approach, the space in the SPM is partitioned into areas for 
the various processes. The size of the partitions is determined in a special 
optimization. The SPM is filled during initialization. No further compiler- 
controlled copying is required. Therefore, this approach is called the non-saving 
approach. This approach makes sense only for SPMs large enough to contain 
areas for several processes. 
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3. The third approach is a hybrid approach: The SPM is split into an area jointly 
used by processes and a second area, in which processes obtain some exclusively 
allocated space. The size of the two areas is determined in an optimization. 


In more dynamic cases, the set of applications may vary during the use of the 
system. For such cases, dynamic memory managers are appropriate. Pyka [463] 
published an algorithm based on an SPM manager using indirect addressing and 
being included in the operating system. This approach also allows the migration 
of library elements to the SPM. A reduction of the consumed energy of 25%-35% 
could be achieved despite the additional level of indirect addressing. 

This additional level of indirection can be avoided if a memory management unit 
(see Appendix C) is available. Egger et al. [149] developed a technique exploiting 
MMwUs: at compile time, sections of code are classified as either benefiting or 
not benefiting from an allocation to the SPM. The code benefiting is stored in a 
certain area in the virtual address space. Initially, this area is not mapped to physical 
memory. Therefore, a page fault occurs when the code is accessed for the very first 
time. Page fault handling then invokes the SPM manager (SPMM) and the SPMM 
allocates (and deallocates) space in the SPM, always updating the virtual-to-real 
address translation tables as needed. The approach is designed to handle code and 
is capable of supporting a dynamically changing set of applications. Unfortunately, 
the size of current SPMs corresponds to just a few entries in today’s page tables, 
resulting in a coarse-grained SPM allocation. 


Supporting Different Architectures and Objectives 


We have so far considered different allocation types. Another dimension in SPM 
allocation is the architectural dimension. Implicitly, we have so far considered 
single-core systems with a single-memory hierarchy layer and a single SPM. Other 
architectures exist as well. For example, there may be hybrid systems containing 
both caches and SPM. We can try to reduce cache misses by selectively allocating 
SPM space in case of cache conflicts [92, 280, 611]. Also, we can have different 
memory technologies, like flash memory or other types of nonvolatile RAM [565]. 
For flash memory, load balancing is important. Also, there might be multiple levels 
of memories. 

SPM can possibly be shared across cores. Also, there may be multiple memory 
hierarchy levels, some of which can be shared. Liu et al. [349] present an ILP-based 
approach for this. 

Still another dimension in SPM allocation is the objective function. So far, we 
have focused on energy or run-time minimization. Other objectives can be consid- 
ered as well. Implicitly, we have modeled the average case energy consumption. 
We could have modeled the worst case energy consumption (WCEC) instead. 
The WCEC is an objective considered, for example, by Liu [349]. Reliability and 
endurance are relevant for the design of reliable applications, in particular in the 
presence of aging [566]. It may also be necessary to avoid overheating of memories. 
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7.3.4 Reconciling Compilers and Timing Analysis 


Almost all compilers which are available today do not include a timing model. 
Therefore, the development of real-time software typically has to follow an iterative 
approach: software is compiled by a compiler which is unaware of any timing 
information. The resulting code is then analyzed using a timing analyzer such as 
aiT [4]. If the timing constraints are not met, some of the inputs to the compiler 
run must be changed, and the procedure has to be repeated. We call this “trial- 
and-error’-based development of real-time software. This approach suffers from 
several problems. First of all, the number of required design iterations is initially 
unknown. Furthermore, the compiler used in this approach is “optimizing,” but 
a precise evaluation of objectives apart from the code size is usually impossible. 
Hence, compiler writers can only hope that their “optimizations” have a positive 
impact of the quality of the code in terms of relevant objectives. Due to the complex 
timing behavior of modern processors, this hope is hardly supported by evidence. 
Finally, the “trial-and-error’-based development of real-time software requires the 
designer to find appropriate modifications of the input to the compiler such that the 
real-time constraints will eventually be met. 

This “trial-and-error’-based approach can be avoided if timing analysis is 
integrated into the compiler. This has been the aim of the development of the 
worst case execution time-aware compiler (WCC). The development of WCC 
started at TU Dortmund with an integration of the timing analyzer aiT into an 
experimental compiler for the TriCore architecture. Figure 7.16 shows the resulting 
overall structure. WCC uses the ICD-C compiler infrastructure [230] to read and 
parse C source code. The source is then converted into a “high-level intermediate 
representation” (HL-IR). The HL-IR is an abstract representation of the source code. 
Various optimizations can be applied to the HL-IR. The optimized HL-IR is passed 
to the code selector. The code selector maps source code operations to machine 
instructions. Machine instructions are represented in the low-level intermediate 
representation (LLIR). In order to estimate the WCET gsr, the LLIR is converted 
into the CRL2 representation used by aiT (using the converter LLIR2CRL). aiT is 
then able to generate WCET gs7 for the given machine code. This information 
is converted back into the LLIR representation (using the converter CRL2LLIR). 
WCC uses this information to consider WCET gsr as the objective function during 
optimizations. This can be done straightforward for optimizations at the LLIR 
level. However, many optimizations are performed at the HL-IR-level. WCET ¢sr- 
directed optimizations at this level require using back annotation from the LLIR 
level to the HL-IR level. ICD-C includes this back annotation. 

WCC has been used to study the impact of optimizing for a reduced WCET esr 
in the compiler. The numerous results include a study of the impact of this objective 
for register allocation [158]. Results shown in Fig. 7.17 indicate a dramatic impact. 
WCET gsr can be reduced down to 68.8% of the original WCET gsr on the average 
by just using WCET-aware register allocation in WCC. The largest reduction yields 
a WCETesr of only 24.1% of the original WCET 57. The combined effect of 
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Fig. 7.17 Reduction of WCET gsr by WCET-aware register allocation 


several such optimizations has been analyzed by Lokuciejewski et al. [353]. For 
the considered benchmarks, Lokuciejewski found a reduction of down to 57.1% of 
the original WCET gsr. Lokuciejewski et al. have also used machine learning to 
optimize heuristics for WCET reduction [354]. 


7.4 Power and Thermal Management 


7.4.1 Dynamic Voltage and Frequency Scaling (DVFS) 


Some embedded processors support dynamic power management (see p. 146) and 
dynamic voltage scaling (see p. 144). An additional optimization step can be used to 
exploit these features. Typically, such an optimization step follows code generation 
by the compiler. Optimizations at this step require a global view of all tasks of the 
system, including their dependencies, slack times, etc. 
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Example 7.10 The potential of dynamic voltage scaling is demonstrated in the 
following example [251]. We assume that we have a processor which runs at three 
different voltages, 2.5 V, 4.0 V, and 5.0 V. Assuming an energy consumption of 40 nJ 
per cycle at 5.0 V, Eq. (3.14) can be used to compute the energy consumption at the 
other voltages (see Table 7.1, where 25 nJ is a rounded value). 

Furthermore, we assume that our task needs to execute 10° cycles within 25s. 
There are several ways of doing this, as can be seen from Figs. 7.18, 7.19, and 7.20. 
Using the maximum voltage (see Fig. 7.18), it is possible to shut down the processor 
during the slack time of 5s (we assume the power consumption to be zero during 
this time). 

Another option is to initially run the processor at full speed and then reduce 
the voltage when the remaining cycles can be completed at the lowest voltage (see 
Fig. 7.19). 

Finally, we can run the processor at a clock rate just large enough to complete 
the cycles within the available time (see Fig. 7.20). 

The corresponding energy consumptions can be calculated as 


Ea = 10° x 40 x 107°J = 40J (7.14) 
Ep = 750 * 10° * 40 * 107° + 250 + 10° x 10 x 107°J = 32.53 (7.15) 
Ec = 10° *25* 107°J = 25J (7.16) 


The smallest energy consumption is achieved for the ideal supply voltage of 4 volts, 
with no idle time at the end. V 


In the following, we use the term variable voltage processor only for processors 
that allow any supply voltage up to a certain maximum. It is expensive to support 


Table 7.1 Characteristics of Vaa [V] 5.0 14.0 |2.5 

processor with DVFS S i l 
Energy per cycle [nJ] |40 |25 |10 
fmax [MHz] |50 |40 |25 
Cycle time [ns] [20 |25 |40 

Fig. 7.18 Possible voltage [v2] 410 9 cycles@50 MHz 40 J 


schedule 52 
deadline 


5 10 15 20 25 tls] 


Fig. 7.19 Second voltage Wal 750M cycles @ 50 MHz + 
schedule 250M cycles @ 25 MHz 


5 10 15 20 25 tls] 
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Fig. 7.20 Third voltage [V7] 
schedule 5? 


10 9 cycles@40 MHz 


25 J 


5 10 15 20 25 tls] 


truly variable voltages, and therefore, actual processors support only a few fixed 
voltages. 

The observations made for the above example can be generalized into the 
following statements. The proofs of these statements are given in the paper by 
Ishihara and Yasuura [251]. 


e Ifa variable voltage processor completes a task before the deadline, the energy 
consumption can be reduced.* 

e If a processor uses a single supply voltage V, and completes a task t just at 
its deadline, then V; is the unique supply voltage which minimizes the energy 
consumption of T. 


If a processor can only use a number of discrete voltage levels, then a voltage 
schedule using the two voltages which are the two immediate neighbors of the 
ideal voltage Videa] can be chosen. These two voltages lead to the minimum energy 
consumption except if the need to use an integer number of cycles results in a small 
deviation from the minimum.* 

The statements can be used for allocating voltages to tasks. Next, we will 
consider such an allocation. We will use the following notation: 


n : the number of tasks 

EC; : the number of executed cycles of task j 

L : the number of voltages of the target processor 

Vi : the ith voltage, where 1 <i < L 

fi : the clock frequency for supply voltage V; 

d : the global deadline at which all tasks must have been completed 

SCj : the average switching capacitance during the execution of task j (SC; 


comprises the actual capacitance Cz and the switching activity œ (see 
Eq. (3.14) on page 144)) 


The voltage scaling problem can then be formulated as an integer linear 
programming (ILP) problem (see p. 393). Toward this end, we introduce variables 
X;,; denoting the number of cycles executed at a particular voltage: 


Xi, j: the number of clock cycles task j is executed at voltage V; 


3This formulation makes an implicit assumption in lemma 1 of the paper by Ishihara and Yasuura 
explicit. 


4This need is not considered in the original paper. 
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Simplifying assumptions of the ILP model include the following: 


e There is one processor that can be operated at a limited number of discrete 
voltages. 

e The time for voltage and frequency switches is negligible. 

e The worst case number of cycles for each task is known. 


Using these assumptions, the ILP problem can be formulated as follows: 
Minimize 


n L 
E= 9 SC; «Xj *V? (7.17) 
j=l i=1 
subject to 
L 
vj: Xp SEC; (7.18) 
i=l 
and 
; 5 Kig (1.19) 
> ti ~~ 


The goal is to find the number X;,; of cycles that each task t; is executed at a 
certain voltage V;. According to the statements made above, no task will ever need 
more than two voltages. Using this model, Ishihara and Yasuura show that efficiency 
is typically improved if tasks have a larger number of voltages to choose from. If 
large amounts of slack time are available, many voltage levels help to find close to 
optimal voltage levels. However, four voltage levels do already give good results 
quite frequently. 

There are many cases in which tasks actually run faster than predicted by their 
worst case execution times. This cannot be exploited by the above algorithm. This 
limitation can be removed by using checkpoints at which actual and worst case 
execution times are compared and then to use this information to potentially scale 
down the voltage [30]. Also, voltage scaling in multi-rate task graphs was proposed 
[479]. DVFS can be combined with other optimizations such as body biasing [369]. 
Body biasing is a technique for reducing leakage currents. 


7.4.2 Dynamic Power Management (DPM) 


In order to reduce the energy consumption, we can also take advantage of power- 
saving states, as introduced on p. 146. The essential question for exploiting DPM is: 
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when should we go to a power-saving state? Straightforward approaches just use a 
simple timer to transition into a power-saving state. More sophisticated approaches 
model the idle times by stochastic processes and use these to predict the use of 
subsystems with more accuracy. Models based on exponential distributions have 
been shown to be inaccurate. Sufficiently accurate models include those based on 
renewal theory [490]. 

A comprehensive discussion of power management was published (see, for 
example, [46, 356]). There are also advanced algorithms which integrate DVS and 
DPM into a single optimization approach for saving energy [491]. 

Allocating voltages and computing transition times for DPM may be two of the 
last steps of optimizing embedded software. 

Power management is also linked to thermal management. 


7.4.3 Thermal Management 


Design time planning of the thermal behavior would need to leave large margins in 
terms of available performance. Hence, it is necessary to use run-time monitoring of 
temperatures. This means that thermal sensors must be available in systems which 
potentially could get too hot. This information is then used to control the generation 
of additional heat and possibly has an impact on cooling mechanisms as well. Many 
users of mobile phones may already have observed this: it is, for example, very 
common to stop charging a mobile phone when it is already too hot. Controlling 
fans (when available) can be considered as another case of thermal management. 
Also, systems may be shutting down partially or completely, if temperatures are 
exceeding maximum thresholds. Shutdown areas of silicon chips can be called “dark 
silicon.’ Some systems may be reducing the clock frequencies and voltages. There 
are also other options like a reduction of the performance by intentionally not using 
some of the available hardware. It is possible, for example, to issue less instructions 
per clock cycle or not to use some of the processor pipelines. For multiprocessor 
systems, tasks may be automatically migrated between various processors. In all of 
these cases, the objective “temperature” is evaluated at run-time and used to have an 
impact at run-time. Avoiding overheating is the goal of the work reported by Merkel 
et al. [391] and by Donald et al. [135]. Using temperature sensors to control the 
system means that control loops are being created. Potentially, such loops could start 
to oscillate. Atienza et al. have compared the behavior of various control strategies 
and came to the conclusion that an advanced control loop algorithm provides the 
best results, with a higher computing performance at a lower temperature, compared 
to standard approaches [610]. The details of this control loop design would be 
beyond the scope of a textbook useful for undergraduate students. 
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7.5 Problems 


We suggest solving the following problems either at home or during a flipped 
classroom session: 


7.1 Loop unrolling is one of the potentially useful optimizations. Please name two 
potential benefits and two potential problems! 


7.2 We assume that you want to use loop tiling. How can you adjust the tiling to 
the memory architecture at hand? 


7.3 For which architectures would you expect the largest benefits from a replace- 
ment of floating-point arithmetic by fixed-point arithmetic? 


7.4 Provide an overview over techniques for taking advantage of scratch pad 
memories! 


7.5 Consider the following program: 


1 #include <stdio.h> 

2 #define DATALEN 15 

3  #define FILTERTAPS 5 

4 double x[DATALEN] = { 128.0, 130.0, 180.0, 140.0, 120.0, 
5 110.0, 107.0, 103.5, 102.0, 90.0, 
6 84.0, 70.0, 30.0, 77.3, 95.7 }; 
7 const double h[FILTERTAPS]={0.125,-@.25,0.5,-@.25,0.125}; 
8 double y[DATALEN]; // result; 

9 int main(void) { 

10 int i,n; 

11 for (i=0;i<DATALEN; ++i) { 

12 y[i] = ð; 

13 for(n=0; n < FILTERTAPS; ++n) 

14 if ((i-n) > = 0) y[i] += h[n]»x[i-n]; 

15 } 

16 for(i = @; i < DATALEN; ++i) printf("%.2f ",yLil); 

17 return ð; 

18 3 


Perform at least the following optimizations: 


e Removal of the if in the innermost loop (line 14) 
e Loop unrolling (line 13) 

e Constant propagation 

e Floating-point to fixed-point conversion 

e Avoidance of all accesses to arrays 


Please provide the optimized version of the program after each of the transforma- 
tions and do also check for consistent results! 
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Table 7.2 SPM mapping: left, accesses to variables; right, memory characteristics 


Number of 
Variable | Size [bytes] | accesses 
a 1024 16 
b 2048 1024 Energy 
c 512 2048 Memory Size [bytes] per access 
d 256 512 Scratchpad |4096 (4k) [13m 
e 128 256 Main memory | 262,144 (256k) | 31nJ 
f 1024 512 
g 512 64 
h 256 512 


7.6 Suppose that your computer is equipped with a main memory and a scratchpad 
memory. Sizes and the required energy per access are shown in Table 7.2 (right). 
Characteristics of accesses to variables are as indicated in Table 7.2 (left). 

Which of those variables should be allocated to the scratchpad memory, provided 
that we use a static, non-overlaying allocation of variables? Use the integer linear 
problem (ILP) model to select the variables. Your result should include the ILP 


model as well as the results. You may use the /p_solve program [17] to solve your 
ILP problem. 
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Chapter 8 A 
Test ga 


Unfortunately, we cannot rely on designed and possibly already manufactured 
systems to operate as expected. These systems may have become defective during 
their use, or their function may have been compromised during the fabrication 
or their design. The purpose of testing is to verify whether or not an existing 
embedded/cyber-physical system can be operated as expected. In this chapter, we 
will present fundamental terms and techniques for testing. There will be a brief 
introduction to the aims of test pattern generation and their application. We will 
be introducing terms such as fault model, fault coverage, fault simulation, and 
fault injection. Also, we will be presenting techniques which improve testability, 
including the generation of pseudo-random patterns, and signature analysis. It would 
be beneficial to consider testability issues already during design. In case of fault- 
tolerant systems, resilience must be verified. 


8.1 Scope 


Testing can be done during or after the fabrication (manufacturing test) and also after 
the system has been delivered to the customer (field testing). Testing of embedded 
systems contained in a cyber-physical or IoT system needs special attention for 
several reasons: 


e Embedded systems integrated into a physical environment may be safety-critical. 
Therefore, their malfunctioning can be much more dangerous than, say, the 
malfunctioning of office equipment. As a result, expectations for the product 
quality are higher than for non-safety-critical systems. 

e Testing of timing-critical systems has to validate the correct timing behavior. This 
means that just testing the functional behavior is not sufficient. 
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Fig. 8.1 Design flow with testing at its very end 


e Testing embedded/cyber-physical systems in their real environment may be 
dangerous. For example, testing control software in a nuclear power plant can 
be a source of serious, far-reaching problems. 


Preparations for testing should be done no later than at the end of the design 
phase. Preferably, necessary support for testing should even be considered earlier, 
intertwined with the design process and using testability as one of the objectives 
for evaluating designs. In order not to overload Chap. 5, we have moved all aspects 
of testing into this separate chapter. The presentation corresponds to considering 
testing only at the very end of the design flow (see Fig.8.1), even though an 
earlier consideration during an actual design would be advisable. However, an early 
consideration is not always common practice, and therefore, Fig. 8.1 might also 
correspond to an actual design flow. 

In testing, we are typically denoting the system under design (SUD) as the device 
under test (DUT). We are applying a set of specially selected input patterns, the so- 
called test patterns to the input(s) of the DUT, observe its behavior, and compare 
the behavior with the expected behavior. Test patterns are normally applied to 
the real, already manufactured system. The main purpose of testing is to identify 
systems that have not been correctly manufactured (manufacturing test) and to 
identify systems that fail later (field test). Testing includes a number of different 
actions: 


. test pattern generation, 

. test pattern application, 

. response observation, and 
. result comparison. 


BRWN re 


8.2 Test Procedures 


8.2.1 Test Pattern Generation for Gate-Level Models 


In test pattern generation, we try to identify a set of test patterns which distinguishes 
a correctly working from an incorrectly working system. Test pattern generation is 
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usually based on fault models. Such fault models are models of possible faults. Test 
pattern generation tries to generate tests for all faults that are possible according to 
a certain fault model. 

The stuck-at-fault model is a frequently used fault model. It is based on the 
assumption that any internal wire of an electronic circuit is permanently connected 
to either 'Q' or '1'. It has been observed that many faults actually behave as if 
some wire was permanently connected that way. 


Example 8.1 As an example, consider the circuit shown in Fig. 8.2.! 

Suppose that we would like to check if there is a stuck-at-1 fault for signal f. 
Toward this end, we try to set f to '0' by setting a = b ='0'. As a result, f should 
be '1' if there is this fault, and otherwise, it should be 'Q'. In order to observe this 
difference, we must propagate it to the output signal i. For this to happen, we must 
sete to '1' and set either c or d to '1'. h andi will be '1' if there is no fault and 'Q' 
otherwise. The test pattern comprises all values of inputs a to e. The D-algorithm 
can be used to generate this test pattern [318]. V 


Many techniques for test pattern generation are based on the stuck-at-fault model. 
However, CMOS technologies require more comprehensive fault models. In CMOS 
technologies, faults can turn combinatorial devices into devices having internal 
states. This problem can occur if wires are broken (this case is known as stuck-at- 
open fault). As a result of this, gates of transistors can become disconnected. Such 
transistors will be conducting or nonconducting, depending on the charge stored on 
the gate before the wire was broken. In this way, the gate “remembers” the input 
signal due to stored charges. Furthermore, there may be transient faults and delay 
faults (faults changing the delay of a circuit). Delay faults may be the result of cross 
talk between adjacent wires. Fault models exist which take such hardware faults into 
account [311]. 

While good fault models exist for hardware testing, the same is not true for 
software testing. 


'Please remember: consistent with standard ANSI/IEEE 91, the symbols >1 and & denote OR- 
and AND-gates, respectively. 
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8.2.2 Self-Test Programs 


One of the key problems of testing modern integrated circuits is their limited number 
of pins, making it more and more difficult to access internal components. Also, it 
is getting very difficult to test these circuits at full speed, since testers must be at 
least as fast as the circuits themselves. The fact that many embedded systems are 
based on processors provides a way out of this dilemma: processors are capable of 
running test programs or diagnostics. Such diagnostics have been used to test main 
frame machines for decades. 


Example 8.2 Figure 8.3 shows components that might be contained in a processor. 
Testing for stuck-at-faults at the input of the ALU is feasible with a small test 
program: 


store pattern of all ‘1's in the register file; 

perform xor between constant "0000. . . 00" and register; 
test if result contains a 'Q' bit; 

if yes, report error; 

otherwise start test for next fault; 


V 


Similar small programs can be generated for other faults. Unfortunately, the 
process of generating diagnostics for main frames has mostly been a manual one. 
Some researchers have proposed to generate diagnostics automatically [48, 53, 64, 
308, 312, 313]. 


8.3 Evaluation of Test Pattern Sets and System Robustness 


8.3.1 Fault Coverage 


The quality of test pattern sets can be evaluated using fault coverage as a metric. 


Definition 8.1 Fault coverage is the percentage of potential faults that can be found 
for a given test pattern set: 
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Number of detectable faults for a given test pattern set 


Coverage = : 
Number of faults possible due to the fault model 


In practice, achieving a good product quality requires fault coverages in the area 
of at least 98-99%. The requirements may be higher for particular systems. Also, 
special fault models may be necessary for certain hardware components (e.g., for 
batteries). 

In addition to achieving a high coverage, we must also achieve a high correctness 
coverage. This means that a fault-free system must be recognized as such. Other- 
wise, it would be possible to achieve a 100% coverage by classifying all systems as 
faulty. Note the link to the metrics in Sect. 5.3.3. 

In order to increase the number of options that exists for system validation, it has 
been proposed to use test methods already during the design phase. For example, test 
pattern sets can be applied to software models of systems in order to check if two 
software models behave in the same way. More time-consuming formal methods 
need to be applied only to those cases in which the system passed this test-based 
equivalence check. 


8.3.2 Fault Simulation 


It is currently not feasible (and it will probably not be feasible) to completely predict 
the behavior of systems in the presence of faults or to analytically compute the 
coverage. Therefore, the behavior of systems in the presence of faults is frequently 
simulated. This type of simulation is called fault simulation. In fault simulation, 
system models are modified to reflect the behavior of the system in the presence of 
a certain fault. The goals of fault simulation include: 


e to know the effect of a fault of the components at the system level (i.e., to check 
whether faults are redundant) 
e to know whether or not mechanisms for improving fault tolerance actually help. 


Definition 8.2 Faults are called redundant if they do not affect the observable 
behavior of the system. 


Fault simulation requires the simulation of the system for all faults feasible for 
the fault model and also for a possibly large number of different input patterns. 
Accordingly, fault simulation is an extremely time-consuming process. Different 
techniques have been proposed to speed up fault simulation. 

One such technique applies to fault simulation at the gate level. In this case, 
internal signals are single-bit signals. This fact enables the mapping of a signal to 
a single bit of some machine word of a simulating host machine. AND- and OR- 
machine instructions can then be used to simulate Boolean networks. However, only 
a single bit would be used per machine word. Efficiency is improved with parallel 
fault simulation. 
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Definition 8.3 Fault simulation is called parallel fault simulation if n > 1 
different test patterns are simulated at the same time, where n is the length of a 
bit vector supported as a machine data type of the simulating processor. 


The values of each of the n test patterns are mapped to a different bit position in the 

machine. Executing the same set of AND- and OR-instructions will then simulate 

the behavior of the Boolean network for n test patterns instead of for just one. 
AVX instructions mentioned on p. 154 are very useful for this. 


8.3.3 Fault Injection 


Fault simulation may be too time-consuming for real systems. If actual systems 
are available, fault injection can be used instead. In fault injection, real existing 
systems are modified, and the overall effect on the system behavior is checked. Fault 
injection does not rely on fault models (even though they can be used). Hence, fault 
injection has the potential of generating faults that would not have been predicted 
by a fault model. We can distinguish between two types of fault injection: 


e local faults within the system 

e faults in the environment (behaviors which do not correspond to the specifica- 
tion). For example, we can check how the system behaves if it is operated outside 
the specified temperature or radiation ranges. 


Several methods can be used for fault injection: 


e fault injection at the hardware level: examples include pin manipulation and 
electromagnetic and nuclear radiation 

e fault injection at the software level: examples include toggling some memory 
bits. 


The quality of fault injection depends on the “probe effect”: probing might have an 
impact on the behavior of the system. This impact should be as small as possible 
and essentially be negligible. 

According to experiments reported by Kopetz [303], software-based fault injec- 
tion was essentially as effective as hardware-based fault injection. Nuclear radiation 
was a noticeable exception in that it generated errors which were not generated with 
other methods. 
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8.4 Design for Testability 


8.4.1 Motivation 


Ideas for test pattern generation for Boolean circuits have been presented in 
Subsection 8.2.1. For circuits implementing state machines (automata), test pattern 
generation is more difficult. Verifying whether or not two finite state machines are 
equivalent may require complex input sequences [301]. 


Example 8.3 The state chart of Fig. 2.25 is shown again in Fig. 8.4 for convenience. 
Suppose that we would like to test the transition from state C to state D. This 
requires us to get into state C first, by applying an appropriate sequence of input 
patterns. Assuming that we start from the default state, we have to generate a 
sequence comprising signals g and h. Next, we must generate input event i and 
check if output y is generated. Also, we need to check if we reached state D. We 
could apply input signal j and check if output z is emitted. Still, we would not 
be sure that we actually had reached state D. There could be a fault resulting in 
the generation of z for a transition from a different state. This procedure is rather 
complicated, takes a lot of time, and is susceptible to interference with other errors. 
Nevertheless, the procedure could even be more complicated since the overall test 
in our example is simplified by the fact that the FSM contains a linear chain of 
transitions (see the assignments of this chapter). V 


This example demonstrates if testing comes in only as an afterthought, it may be 
very difficult to test a system. In order to simplify tests, special hardware can 
be added such that testing becomes easier. The process of designing for better 
testability is called design for testability (DfT). Special purpose hardware for 
testing finite state machines is a prominent example of this. 


8.4.2 Scan Design 


Reaching certain states and observing states resulting from the application of input 
patterns are very much simplified with scan design. In scan design, all flip-flops 
storing states are connected to form serial shift registers (see Fig. 8.5). The circuit 


Fig. 8.4 Finite state machine to be tested 
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contains three D-type flip-flops (DFF) and one multiplexer at each of the flip-flop 
inputs. Using the control input of the multiplexers (shown at the bottom of the 
multiplexer inputs), either we can connect the flip-flops to the network generating 
the next state from the current state and the current input or we can connect flip-flops 
to form a serial chain. Setting the multiplexers to scan mode, we can load state bit 
after state bit into the scan chain (1 bit at every clock tick). This way, we can load 
any state into the three flip-flops serially. In a second phase, we can apply an input 
pattern to the FSM while the multiplexers are set to normal mode. After the next 
clock tick, the FSM will be in a new state. This new state can be serially shifted out 
in the third and final phase, using the serial mode again (1 bit per clock tick). The net 
effect is that we do not need to worry about how to get into certain states and how 
to observe whether or not the Boolean function ô for computing the next state has 
been correctly implemented while we are generating tests for the FSM. Effectively, 
the fact that we are dealing with state-based systems has an impact only on the two 
(simple) shift phases, and test pattern generation for (stateless) Boolean networks 
can be used for checking for correct outputs. This means that it is sufficient to use 
test pattern generation methods for Boolean functions (stateless networks) instead 
of caring about complex input sequences, etc. 

Scan design is a technique which works well for single chips. For board-level 
integration, it is necessary to have some technique for connecting scan chains of 
several chips. JTAG is a standard designed for this. The standard defines registers 
at the boundaries of all chips and a number of test pins and control commands such 
that all chips can be connected in scan chains. JTAG is also known as boundary scan 
[447]. 
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Fig. 8.7 LFSR for response compaction: left, schematic; right, state diagram 


8.4.3 Signature Analysis 


In order to also avoid shifting out the response of the device under test (DUT), 
responses can be compacted. A setup like the pipeline shown in Fig. 8.6 can be used 
for this purpose. Generated test patterns are used as inputs (or so-called stimuli) to 
the DUT. The response of the DUT is then compacted to form a signature, which 
characterizes the response. This response is later compared to the expected response. 
The expected response can be computed by simulation. 

The compaction is usually performed with linear feedback shift registers 
(LFSRs), shift registers with an XOR-feedback. 


Example 8.4 Figure 8.7 shows a 4-bit LFSR (left) and the associated state diagram 
(right) [318]. Blue dashed lines denote an input of '1'; red solid lines denote an 
input of '@'. The selected feedback yields all possible signatures. During testing, 
the response of the system tested is sent to the input of the LFSR. The LFSR will 
then generate a signature reflecting the response. V 


Due to storing the signature instead of the full response, several response patterns 
can be mapped to the same signature. What is the probability of obtaining a correct 
signature from an incorrect response? 

In general, an n-bit signature generator can generate 2” signatures. For an m-bit 
response of the DUT, the best that we can do is to evenly map 2%™”7" responses to 
the same signature. Suppose that we expect a certain signature to be generated for 
the correct response of the system. Then, 2”) — 1 incorrect responses would also 
map to the same signature. There is a total of 2” — 1 incorrect responses if responses 
are m-bit long. Hence, the probability of an incorrect response to map to the correct 
signature (provided patterns map evenly to signatures) is 
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other patterns mapping to the same signature 
P=Pr (8.1) 
total number of other patterns 

Q(m—n) =i 

Tn (8.2) 
Q(m—n) 

x Fm form >n (8.3) 
1 

x zn form >n (8.4) 


This means that the probability of generating correct signatures from an incorrect 
test response is very small if the shift register is long. For example, actual shift 
registers may be 32 bits long. Nevertheless, it is still feasible to have the correct 
signature for wrong inputs. The corresponding effect is called aliasing. A careful 
analysis of aliasing is recommended at least for critical applications. 


8.4.4 Pseudo-random Test Pattern Generation 


For chips with a large number of flip-flops, it can take quite some time to shift in 
the test patterns. In order to speed up the process of generating patterns on the chip, 
it has been proposed to also integrate hardware for generating test patterns on the 
chip. This is especially useful when the bandwidth for accesses from outside the 
chip is much less than the internal bandwidth on the chip. 

For example, pseudo-random patterns (also generated by LFSRs) can be used as 
test patterns. This method typically requires less chip space than patterns stored in 
a table. 


Example 8.5 We can modify the circuit of Fig. 8.7 as shown in Fig. 8.8. The circuit 
generates all possible test patterns, except the pattern consisting of all zeros. 
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Fig. 8.8 Linear feedback shift register for test pattern generation 
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Patterns consisting of all zeros have to be avoided, since the generator would get 
stuck once it arrives at such a pattern. The generated patterns are typically exercising 
systems to be tested much better than simple counters. 


8.5 Problems 


8.1 Consider the circuit shown in Fig. 8.2. Generate a test pattern for a stuck-at-@ 
fault at signal h! 


8.2 Which state diagram corresponds to the LFSR shown in Fig. 8.9? 


8.3 Specify test patterns and expected responses for the FSM shown in Fig. 8.4. 
These patterns must be specified as a sequence of pairs (test pattern, expected 
response). Events shown in Fig. 8.4 can be used as test patterns. We assume that 
the FSM will be in the default state after power on. Provide a complete test for all 
transitions! Note that the special chain-like structure of the FSM simplifies testing. 
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Appendix A 
Integer Linear Programming 


We assume that not all readers of this book are familiar with all the prerequisites 
required for understanding all previous chapters. We use appendices to commu- 
nicate some of the knowledge which is possibly missing. Three topic areas are 
covered in the appendices. In the current appendix, we will be present integer linear 
programming. Integer linear programming (ILP) is a mathematical optimization 
technique applicable to a large number of optimization problems. 

ILP models provide a general approach for modeling optimization problems. ILP 
models consist of two parts: a cost function and a set of constraints. Both parts 
involve references to a set X = {x;} of integer-valued variables. Cost functions 
must be linear functions of those variables. So, they must be of the general form 


C = J aixi, with aj € R, x; € No (A.1) 


t 


The set J of constraints must also consist of linear functions of integer-valued 
variables. They must be of the form 


Viet: J bijxi = cj with bi j,cj € R (A.2) 


L 


Definition A.1 The integer linear programming (ILP) problem is the problem 
of minimizing the cost function of Eq.(A.1) subject to the constraints given in 
Eq. (A.2). If all variables are constrained to being either 0 or 1, the corresponding 
model is called a 0/1-integer linear programming model. In this case, variables 
are also denoted as (binary) decision variables. 


Note that > can be replaced by < in Eq.(A.2) if constants b; j are modified 
accordingly. Also, the case of negative variables x; (i.e., allowing x; to have any 
integer value) can be transformed into the case of non-negative variables shown 
above by multiplying constants by —1. Applications requiring maximizing some 
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Table A.1 Possible solutions 
of the presented ILP problem 


O |1 |1 | 10 

1 jO |1 9 

1 |1 jO j1l 

1 j1 j1 415 
gain function C’ can be changed into the above form by setting C = —C’. Equations 


may be represented by pairs of constraints, but they are typically used to eliminate 
some variables. 


Example A.1 Assuming that x1, x2, and x3 must be integers, the following set of 
equations represent a 0/1-IP model: 


C = 5x1 + 6x2 + 4x3 (A.3) 
x1 2+ 3 22 (A.4) 
xp<l (A.5) 
x2 <1 (A.6) 
x3 <1 (A.7) 


Due to the constraints, all variables are either O or 1. There are four possible 
solutions. These are listed in Table A.1. The solution with a cost of 9 is optimal. V 


ILP is a variant of linear programming (LP). For linear programming, variables can 
take any real values. ILP and LP models can be solved optimally using mathematical 
programming techniques. Unfortunately, ILP is NP-complete (but LP is not), and 
ILP execution times may become very large. 

Nevertheless, ILP models are useful for modeling optimization problems as 
long as the model sizes are not extremely large. Modeling optimization problems 
as integer linear programming problems makes sense despite the complexity of 
the problem: many problems can be solved in acceptable execution times, and if 
they cannot, ILP models provide a good starting point for heuristics. Execution 
times depend on the number of variables and on the number and structure of the 
constraints. Good ILP solvers (like Ip_solve [17] or CPLEX) can solve well-structured 
problems containing a few thousand variables in acceptable computation times (e.g., 
minutes). For more information on ILP and LP, refer to books on the topic (e.g., to 
Wolsey [594]). 


Appendix B 
Kirchhoff’s Laws and Operational 
Amplifiers 


Our presentation of D/A-converters on p. 180 assumes some basic knowledge 
about operational amplifiers. This knowledge is frequently lacking among computer 
science students, and therefore the necessary fundamentals are presented in this 
appendix. These fundamentals require an understanding of Kirchhoff’s laws, of 
which students will also be reminded in this Appendix. 


B.1 Kirchhoff’s Laws 


Kirchhoff’s laws provide a means for analyzing electrical circuits. The first rule is 
Kirchhoff’s Current Law, also called Kirchhoff’s Junction Rule, or Kirchhoff’s First 
Law. The rule applies to junctions such as the one shown in Fig. B.1. 


Theorem B.1 (Kirchhoff’s Current Law) At any point in an electrical circuit, 
the sum of currents flowing toward that point is equal to the sum of currents flowing 
away from that point [273]. Formally, for any node in a circuit, we have: 


ye =0 (B.1) 


k 


If Kirchhoff’s law is used in the form of Eq.(B.1), currents denoted by arrows 
pointing away from the node must be counted as negative, and this counting is 
independent of the direction into which electrons are actually flowing. 


Example B.1 For the currents of Fig. B.1, we have 


ii +i2— i3 +i4=0 (B.2) 
ii + i2 + i4 = i3 (B.3) 
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This invariance exists due to the conservation of electrical charge. Without this 
tule, the total electrical charge would not remain constant, and the voltage would 
increase. 

Kirchhoff’s second rule applies to loops in a circuit. It is known as Kirchhoff’s 
Voltage Law, Kirchhoff’s loop rule, or Kirchhoff’s Second Law. Figure B.2 shows 
an example. 


Theorem B.2 (Kirchhoff’s Voltage Law) The sum of the potential differences 
(voltages) across all elements around any closed circuit must be zero [273]. 
Formally, for any loop in a circuit, we have: 


yi M% =0 (BA) 
k 


If we traverse voltages against the arrow direction, we have to count them as 
negative. 


Example B.2 For the schematic of Fig. B.2, we have 
VY, —V2 —-V¥3+V4 =0 (B.5) 


The underlying reason for this invariance is the conservation of energy. Without 
this rule, we could accelerate charge in the loop, and the charge would accumulate 
energy without any energy consumption elsewhere. 

In general, it is not relevant into which direction electrons are actually flowing 
and which of two terminals is actually positive with respect to some other terminal. 
Arrows can be selected in an arbitrary way. We just have to make sure that we respect 
the direction of the arrows when we apply Kirchhoff’s laws. If arrows for voltages 
and currents across components are pointing in opposite directions, the equation for 
that component has to take that into account. 
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Example B.3 Ohm’s law for resistor R3 in Fig. B.2 reads as follows, due to the 
opposite directions of voltage and current arrows: 


h=% (B.6) 


Of course, we will typically try to define the direction of voltages and currents such 
that we avoid having too many minus signs. 


B.2 Operational Amplifiers 


In electronics, there is frequently the need to amplify some signal x(t) in order to 
obtain some amplified signal y(t) = a - x(t), with a > 1. a is called the gain. 
Designing different circuits for each and every gain would be a laborious task. 
Therefore, designers are frequently using a general amplifier which can be easily 
configured to have the required gain. Such a general amplifier is called operational 
amplifier, or op-amp for short. Op-amps are designed for a very large maximum 
gain. The required actual gain can be adjusted with a proper selection of a few 
hardware components in the circuit surrounding the op-amp. 

More precisely, an operational amplifier is a component having two signal inputs 
and one signal output. In addition, there are at least two power supply inputs (see 
Fig. B.3). 

Op-amps amplify the difference between the voltages at the two signal inputs 
with respect to ground by a gain g: 


Vout = 8 * (V+ — V-) (B.7) 


g is called the open loop gain and is typically very large (e.g., 10* < g < 10°). For 
an ideal op-amp, g would approach infinity. Furthermore, op-amps usually come 
with a very high input impedance (>1M 2). Hence, we can frequently ignore signal 
input currents. For an ideal op-amp, the input impedance would be infinity and input 
currents would be zero. 

Op-amps have been commercially available for decades, both as separate inte- 
grated circuits and within other circuits. They differ by their speed, their voltage 
ranges, their current drive capability, and other characteristics. The actual gain of 
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the circuit is selected with external resistors. Figure B.4 shows how this can be 
done. 

Any small voltage between the two signal inputs is amplified by a large factor. 
Via resistor R4, the resulting output voltage is fed back. Feedback is to the inverting 
input, and therefore, any positive voltage V_ results in a negative voltage Vout and 
vice versa. This means that the feedback will work against the input voltage and it 
does so very strongly, due to the large amplification. Therefore, the feedback will 
reduce the voltage at the input pin. The question is: by how much? We can use 
Kirchhoff’s rules to find the resulting voltage V_ (see Fig. B.5). 

Due to the characteristics of op-amps, we have 


Vout = —8 * V— (B.8) 
Due to Kirchhoff’s law for the loop shown by a dashed line in Fig. B.5, we have 
Ix Ri + Vout — V- = 0 (B.9) 


Note that we include a minus sign for V_ since we are traversing a segment of the 
loop against the direction of the arrow. From Eqs. (B.8) and (B.9), we get 


Ix Ri + (-—g) * V--V-=0 (B.10) 
(+g)*V_=I1x*R, (B.11) 

Ix Ri 
V_ = (B.12) 


l+g 
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Hence, we have 


Vee a B.13 
—,ideal = Poe i+ 2 (B.13) 
=0 (B.14) 


This means that, for an ideal op-amp, V_ is 0. Due to this, the inverting signal input 
is called virtual ground. Nevertheless, this input cannot be connected to ground, 
since this would change the currents. 

Computing the actual gain of the circuit in Fig. B.4 is left as an exercise for 
Chap. 3. 


Appendix C 
Paging and Memory Management Units 


In this Appendix, we are discussing a basic technique for managing memories. 
In simple systems, physical memories are actually addressed by the addresses 
which are seen by (assembly language) programmers. This approach is very easy 
to implement from a hardware technology point of view. However, this approach 
has disadvantages for using the memory. For example, the allocation of objects to 
memory is very static. The size of memory objects needs to be estimated before 
actually allocating memory. 

More flexibility is obtained when we distinguish between the memory addresses 
as seen by the (assembly language) programmer and the ones used to address phys- 
ical memory. Addresses as seen by the programmer are called virtual addresses, 
and addresses seen at the memory are called real or physical addresses. 

For a memory organization called paging, we partition the space of virtual 
addresses into chunks of equal size, called pages. The size of these pages is a power 
of two, such as 2k bytes or 4k bytes. As result, virtual addresses consist of those 
bits addressing a particular page and those addressing a word or byte within a page. 
The first set of bits is called page number, the second the offset. 

Physical memory is partitioned into page frames of the same size. Then, a 
mapping table—called page table—contains the information needed to map page 
numbers to the corresponding start address in physical memory. The offset is 
identical for virtual and real addresses (see Fig. C.1 (left)). This allows for a more 
dynamic allocation to memory. Contiguous ranges of virtual addresses do not need 
to be allocated to contiguous ranges in real memory, offering much more allocation 
freedom (see Fig. C.1 (right)). Certain memory objects (e.g., like stacks) can grow 
and shrink in multiples of page sizes. 

There may be more than one virtual address space, such as one address space 
per process managed by the operating system. In this case, the relevant page table 
has to be set during context switches. The actual mapping from virtual to real 
addresses is typically performed in a memory management unit (MMU). The 
MMU is placed between processors and the memory and converts virtual addresses 
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Fig. C.1 Paging: left, address translation; right, impact on mapping of address spaces 


into real addresses. During its operation, the MMU needs to know the contents 
of the page table. Page tables may be large, and, hence, fast buffers can be used 
as caches specialized for accesses to the page table. These fast buffers are called 
translation look-aside buffers (TLB) or address translation memories (ATM). 
TLBs are assumed to contain copies of frequently used correspondences between 
virtual and real addresses. 

For PCs, the presented approach is frequently combined with demand paging, 
i.e., fetching page frames currently not in main memory from a slower background 
memory on demand. Demand paging is less popular in embedded systems, fre- 
quently due to the non-availability of a background memory. The term “paging” 
is often used for what we have called demand paging. However, the distinction 
between paging as a method for mapping virtual addresses to real addresses and 
paging fetching information from a background memory automatically is important 
for embedded systems. 

In addition to the reasons mentioned above, paging is also useful for memory pro- 
tection. Page table entries commonly contain bits which indicate access permissions 
to the memory represented by this entry. Common permission bits include read, 
write, and execute permissions. These protection bits enable the system designer to 
isolate memory spaces of different tasks or processes against each other and also 
help to protect the operating system from erroneous memory accesses by tasks or 
even rogue processes which try to subvert a system’s security. The latter aspect is 
gaining relevance especially in the context of networked embedded systems, as used 
in Internet of Things applications. 

Please refer to books on computer architecture for more information on memory 
management [211]. 
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